Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, May 2, 2013

download all the stuff from a website by wget

I want to download all the stuff from a website http://python.genedrift.org/

BEGINNING PYTHON FOR BIOINFORMATICS

I googled it, and found this:
GNU Wget is a nice tool for downloading resources from the internet. The basic usage is wget url:
wget http://linuxreviews.org/
The power of wget is that you may download sites recursive, meaning you also get all pages (and images and other data) linked on the front page:
wget -r http://linuxreviews.org/
But many sites do not want you to download their entire site. To prevent this, they check how browsers identify. Many sites refuse you to connect or send a blank page if they detect you are not using a web-browser. You might get a message like:
Sorry, but the download manager you are using to view this site is not supported. We do not support use of such download managers as flashget, go!zilla, or getright
There is a very handy -U option for sites like this. Use
-U My-browser
to tell the site you are using some commonly accepted browser:
 wget  -r -p -U Mozilla http://www.stupidsite.com/restricedplace.html
A web-site owner will probably get upset if you attempt to download his entire site using a simple
wget http://foo.bar
command. However, the web-site owner will not even notice you if you limit the download transfer rate and pause between fetching files.
To make sure you are not manually added to a blacklist, the most important command line options are --limit-rate= and --wait= .
To pause 20 seconds between retrievals you should add
--wait=20
and to limit the download rate use something like
--limit-rate=20K
as this option defaults to bytes, add K to set KB/s.
Example:
wget --wait=20 --limit-rate=20K -r -p -U Mozilla http://www.stupidsite.com/restricedplace.html
A very handy option that guarantees wget will not download anything from the folders beneath the folder you want to acquire is:
--no-parent
Use this to make sure wget does not fetch more than it needs to if you just want to download the files in a folder.
Read the manual page for wget to learn more about GNU Wget. The full official manual is available here.
To install the Gnome front-end for wget click here.


And it worked well for me:) 
wget is pre-installed in bio-linux.






No comments:

Post a Comment