Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Wednesday, December 23, 2015

Using wget to download specific files from ftp but avoiding the directory structure

I want to download some files from a ftp site, and I only want to download some files with names matching a pattern. How can I do it?
Use wget ! It is a very versatile command and I just got to know several tricks.
When there are many levels of folder, you want to search down to all the folders:
-r --recursive Turn on recursive retrieving.
   -l depth
   --level=depth
       Specify recursion maximum depth level depth.  The default maximum depth is 5.
You can specify what files you want to download or reject using wild cards:
Recursive Accept/Reject Options
-A acclist --accept acclist
-R rejlist --reject rejlist
Specify comma-separated lists of file name suffixes or patterns to accept or reject. Note that if any of the wildcard characters, *, ?, [ or ], appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix.
If you want to save the file to a different name:
-O file --output-document=file
The documents will not be written to the appropriate files, but all will be concatenated together and written to file. If - is used as file, documents will be printed to standard output, disabling link conversion. (Use ./- to print to a file literally named -.)
       Use of -O is not intended to mean simply "use the name file instead of the
       one in the URL;" rather, it is analogous to shell redirection: wget -O file
       http://foo is intended to work like wget -O - http://foo > file; file will be
       truncated immediately, and all downloaded content will be written there.

       For this reason, -N (for timestamp-checking) is not supported in combination
       with -O: since file is always newly created, it will always have a very new
       timestamp. A warning will be issued if this combination is used.

       Similarly, using -r or -p with -O may not work as you expect: Wget won’t just
       download the first file to file and then download the rest to their normal
       names: all downloaded content will be placed in file. This was disabled in
       version 1.11, but has been reinstated (with a warning) in 1.11.2, as there
       are some cases where this behavior can actually have some use.

       Note that a combination with -k is only permitted when downloading a single
       document, as in that case it will just convert all relative URIs to external
       ones; -k makes no sense for multiple URIs when they’re all being downloaded
       to a single file.
If you do not need the folder structure:
-nd --no-directories
Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the filenames will get extensions .n).
one alternative way is to specify -nH and --cut-dirs=10 together
-nH --no-host-directories
Disable generation of host-prefixed directories. By default, invoking Wget with -r http://fly.srk.fer.hr/ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.
--cut-dirs=number
Ignore number directory components. This is useful for getting a fine-grained control over the directory where recursive retrieval will be saved.
       Take, for example, the directory at
       ftp://ftp.xemacs.org/pub/xemacs/.  If you retrieve it with -r, it
       will be saved locally under ftp.xemacs.org/pub/xemacs/.  While the
       -nH option can remove the ftp.xemacs.org/ part, you are still stuck
       with pub/xemacs.  This is where --cut-dirs comes in handy; it makes
       Wget not "see" number remote directory components.  Here are
       several examples of how --cut-dirs option works.

               No options        -> ftp.xemacs.org/pub/xemacs/
               -nH               -> pub/xemacs/
               -nH --cut-dirs=1  -> xemacs/
               -nH --cut-dirs=2  -> .

               --cut-dirs=1      -> ftp.xemacs.org/xemacs/
               ...

       If you just want to get rid of the directory structure, this option
       is similar to a combination of -nd and -P.  However, unlike -nd,
       --cut-dirs does not lose with subdirectories---for instance, with
       -nH --cut-dirs=1, a beta/ subdirectory will be placed to
       xemacs/beta, as one would expect.
If you want to save files to a different folder name:
-P prefix --directory-prefix=prefix
Set directory prefix to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree. The default is . (the current directory).
Continue to download a file:
-c --continue
Continue getting a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of Wget, or by another program
There are so many different options, just man wget to see all of them! I am impressed on how versatile this command is!