Go to the first, previous, next, last section, table of contents.


Following Links

When retrieving recursively, one does not wish to retrieve loads of unnecessary data. Most of the time the users bear in mind exactly what they want to download, and want Wget to follow only specific links.

For example, if you wish to download the music archive from `fly.srk.fer.hr', you will not want to download all the home pages that happen to be referenced by an obscure part of the archive.

Wget possesses several mechanisms that allows you to fine-tune which links it will follow.

Spanning Hosts

Wget's recursive retrieval normally refuses to visit hosts different than the one you specified on the command line. This is a reasonable default; without it, every retrieval would have the potential to turn your Wget into a small version of google.

However, visiting different hosts, or host spanning, is sometimes a useful option. Maybe the images are served from a different server. Maybe you're mirroring a site that consists of pages interlinked between three servers. Maybe the server has two equivalent names, and the HTML pages refer to both interchangeably.

Span to any host---`-H'
The `-H' option turns on host spanning, thus allowing Wget's recursive run to visit any host referenced by a link. Unless sufficient recursion-limiting criteria are applied depth, these foreign hosts will typically link to yet more hosts, and so on until Wget ends up sucking up much more data than you have intended.
Limit spanning to certain domains---`-D'
The `-D' option allows you to specify the domains that will be followed, thus limiting the recursion only to the hosts that belong to these domains. Obviously, this makes sense only in conjunction with `-H'. A typical example would be downloading the contents of `www.server.com', but allowing downloads from `images.server.com', etc.:
wget -rH -Dserver.com http://www.server.com/
You can specify more than one address by separating them with a comma, e.g. `-Ddomain1.com,domain2.com'.
Keep download off certain domains---`--exclude-domains'
If there are domains you want to exclude specifically, you can do it with `--exclude-domains', which accepts the same type of arguments of `-D', but will exclude all the listed domains. For example, if you want to download all the hosts from `foo.edu' domain, with the exception of `sunsite.foo.edu', you can do it like this:
wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu \
    http://www.foo.edu/

Types of Files

When downloading material from the web, you will often want to restrict the retrieval to only certain file types. For example, if you are interested in downloading GIFs, you will not be overjoyed to get loads of PostScript documents, and vice versa.

Wget offers two options to deal with this problem. Each option description lists a short name, a long name, and the equivalent command in `.wgetrc'.

`-A acclist'
`--accept acclist'
`accept = acclist'
The argument to `--accept' option is a list of file suffixes or patterns that Wget will download during recursive retrieval. A suffix is the ending part of a file, and consists of "normal" letters, e.g. `gif' or `.jpg'. A matching pattern contains shell-like wildcards, e.g. `books*' or `zelazny*196[0-9]*'. So, specifying `wget -A gif,jpg' will make Wget download only the files ending with `gif' or `jpg', i.e. GIFs and JPEGs. On the other hand, `wget -A "zelazny*196[0-9]*"' will download only files beginning with `zelazny' and containing numbers from 1960 to 1969 anywhere within. Look up the manual of your shell for a description of how pattern matching works. Of course, any number of suffixes and patterns can be combined into a comma-separated list, and given as an argument to `-A'.
`-R rejlist'
`--reject rejlist'
`reject = rejlist'
The `--reject' option works the same way as `--accept', only its logic is the reverse; Wget will download all files except the ones matching the suffixes (or patterns) in the list. So, if you want to download a whole page except for the cumbersome MPEGs and .AU files, you can use `wget -R mpg,mpeg,au'. Analogously, to download all files except the ones beginning with `bjork', use `wget -R "bjork*"'. The quotes are to prevent expansion by the shell.

The `-A' and `-R' options may be combined to achieve even better fine-tuning of which files to retrieve. E.g. `wget -A "*zelazny*" -R .ps' will download all the files having `zelazny' as a part of their name, but not the PostScript files.

Note that these two options do not affect the downloading of HTML files; Wget must load all the HTMLs to know where to go at all--recursive retrieval would make no sense otherwise.

Directory-Based Limits

Regardless of other link-following facilities, it is often useful to place the restriction of what files to retrieve based on the directories those files are placed in. There can be many reasons for this--the home pages may be organized in a reasonable directory structure; or some directories may contain useless information, e.g. `/cgi-bin' or `/dev' directories.

Wget offers three different options to deal with this requirement. Each option description lists a short name, a long name, and the equivalent command in `.wgetrc'.

`-I list'
`--include list'
`include_directories = list'
`-I' option accepts a comma-separated list of directories included in the retrieval. Any other directories will simply be ignored. The directories are absolute paths. So, if you wish to download from `http://host/people/bozo/' following only links to bozo's colleagues in the `/people' directory and the bogus scripts in `/cgi-bin', you can specify:
wget -I /people,/cgi-bin http://host/people/bozo/
`-X list'
`--exclude list'
`exclude_directories = list'
`-X' option is exactly the reverse of `-I'---this is a list of directories excluded from the download. E.g. if you do not want Wget to download things from `/cgi-bin' directory, specify `-X /cgi-bin' on the command line. The same as with `-A'/`-R', these two options can be combined to get a better fine-tuning of downloading subdirectories. E.g. if you want to load all the files from `/pub' hierarchy except for `/pub/worthless', specify `-I/pub -X/pub/worthless'.
`-np'
`--no-parent'
`no_parent = on'
The simplest, and often very useful way of limiting directories is disallowing retrieval of the links that refer to the hierarchy above than the beginning directory, i.e. disallowing ascent to the parent directory/directories. The `--no-parent' option (short `-np') is useful in this case. Using it guarantees that you will never leave the existing hierarchy. Supposing you issue Wget with:
wget -r --no-parent http://somehost/~luzer/my-archive/
You may rest assured that none of the references to `/~his-girls-homepage/' or `/~luzer/all-my-mpegs/' will be followed. Only the archive you are interested in will be downloaded. Essentially, `--no-parent' is similar to `-I/~luzer/my-archive', only it handles redirections in a more intelligent fashion.

Relative Links

When `-L' is turned on, only the relative links are ever followed. Relative links are here defined those that do not refer to the web server root. For example, these links are relative:

<a href="foo.gif">
<a href="foo/bar.gif">
<a href="../foo/bar.gif">

These links are not relative:

<a href="/foo.gif">
<a href="/foo/bar.gif">
<a href="http://www.server.com/foo/bar.gif">

Using this option guarantees that recursive retrieval will not span hosts, even without `-H'. In simple cases it also allows downloads to "just work" without having to convert links.

This option is probably not very useful and might be removed in a future release.

Following FTP Links

The rules for FTP are somewhat specific, as it is necessary for them to be. FTP links in HTML documents are often included for purposes of reference, and it is often inconvenient to download them by default.

To have FTP links followed from HTML documents, you need to specify the `--follow-ftp' option. Having done that, FTP links will span hosts regardless of `-H' setting. This is logical, as FTP links rarely point to the same host where the HTTP server resides. For similar reasons, the `-L' options has no effect on such downloads. On the other hand, domain acceptance (`-D') and suffix rules (`-A' and `-R') apply normally.

Also note that followed links to FTP directories will not be retrieved recursively further.


Go to the first, previous, next, last section, table of contents.