Go to the first, previous, next, last section, table of contents.


Appendices

This chapter contains some references I consider useful.

Robots

It is extremely easy to make Wget wander aimlessly around a web site, sucking all the available data in progress. `wget -r site', and you're set. Great? Not for the server admin.

While Wget is retrieving static pages, there's not much of a problem. But for Wget, there is no real difference between a static page and the most demanding CGI. For instance, a site I know has a section handled by an, uh, bitchin' CGI script that converts all the Info files to HTML. The script can and does bring the machine to its knees without providing anything useful to the downloader.

For such and similar cases various robot exclusion schemes have been devised as a means for the server administrators and document authors to protect chosen portions of their sites from the wandering of robots.

The more popular mechanism is the Robots Exclusion Standard, or RES, written by Martijn Koster et al. in 1994. It specifies the format of a text file containing directives that instruct the robots which URL paths to avoid. To be found by the robots, the specifications must be placed in `/robots.txt' in the server root, which the robots are supposed to download and parse.

Wget supports RES when downloading recursively. So, when you issue:

wget -r http://www.server.com/

First the index of `www.server.com' will be downloaded. If Wget finds that it wants to download more documents from that server, it will request `http://www.server.com/robots.txt' and, if found, use it for further downloads. `robots.txt' is loaded only once per each server.

Until version 1.8, Wget supported the first version of the standard, written by Martijn Koster in 1994 and available at http://www.robotstxt.org/wc/norobots.html. As of version 1.8, Wget has supported the additional directives specified in the internet draft `<draft-koster-robots-00.txt>' titled "A Method for Web Robots Control". The draft, which has as far as I know never made to an RFC, is available at http://www.robotstxt.org/wc/norobots-rfc.txt.

This manual no longer includes the text of the Robot Exclusion Standard.

The second, less known mechanism, enables the author of an individual document to specify whether they want the links from the file to be followed by a robot. This is achieved using the META tag, like this:

<meta name="robots" content="nofollow">

This is explained in some detail at http://www.robotstxt.org/wc/meta-user.html. Wget supports this method of robot exclusion in addition to the usual `/robots.txt' exclusion.

Security Considerations

When using Wget, you must be aware that it sends unencrypted passwords through the network, which may present a security problem. Here are the main issues, and some solutions.

  1. The passwords on the command line are visible using ps. If this is a problem, avoid putting passwords from the command line--e.g. you can use `.netrc' for this.
  2. Using the insecure basic authentication scheme, unencrypted passwords are transmitted through the network routers and gateways.
  3. The FTP passwords are also in no way encrypted. There is no good solution for this at the moment.
  4. Although the "normal" output of Wget tries to hide the passwords, debugging logs show them, in all forms. This problem is avoided by being careful when you send debug logs (yes, even when you send them to me).

Contributors

GNU Wget was written by Hrvoje Nik@v{s}i'{c} hniksic@arsdigita.com. However, its development could never have gone as far as it has, were it not for the help of many people, either with bug reports, feature proposals, patches, or letters saying "Thanks!".

Special thanks goes to the following people (no particular order):

The following people have provided patches, bug/build reports, useful suggestions, beta testing services, fan mail and all the other things that make maintenance so much fun:

Ian Abbott Tim Adam, Adrian Aichner, Martin Baehr, Dieter Baron, Roger Beeman, Dan Berger, T. Bharath, Paul Bludov, Daniel Bodea, Mark Boyns, John Burden, Wanderlei Cavassin, Gilles Cedoc, Tim Charron, Noel Cragg, Kristijan @v{C}onka@v{s}, John Daily, Andrew Davison, Andrew Deryabin, Ulrich Drepper, Marc Duponcheel, Damir D@v{z}eko, Alan Eldridge, Aleksandar Erkalovi'{c}, Andy Eskilsson, Christian Fraenkel, Masashi Fujita, Howard Gayle, Marcel Gerrits, Lemble Gregory, Hans Grobler, Mathieu Guillaume, Dan Harkless, Herold Heiko, Jochen Hein, Karl Heuer, HIROSE Masaaki, Gregor Hoffleit, Erik Magnus Hulthen, Richard Huveneers, Jonas Jensen, Simon Josefsson, Mario Juri'{c}, Hack Kampbj@o rn, Const Kaplinsky, Goran Kezunovi'{c}, Robert Kleine, KOJIMA Haime, Fila Kolodny, Alexander Kourakos, Martin Kraemer, Hrvoje Lacko, Daniel S. Lewart, Nicol'{a}s Lichtmeier, Dave Love, Alexander V. Lukyanov, Jordan Mendelson, Lin Zhe Min, Tim Mooney, Simon Munton, Charlie Negyesi, R. K. Owen, Andrew Pollock, Steve Pothier, Jan P@v{r}ikryl, Marin Purgar, Csaba R'{a}duly, Keith Refson, Tyler Riddle, Tobias Ringstrom, Edward J. Sabol, Heinz Salzmann, Robert Schmidt, Andreas Schwab, Chris Seawood, Toomas Soome, Tage Stabell-Kulo, Sven Sternberger, Markus Strasser, John Summerfield, Szakacsits Szabolcs, Mike Thomas, Philipp Thomas, Dave Turner, Russell Vincent, Charles G Waldman, Douglas E. Wegscheid, Jasmin Zainul, Bojan @v{Z}drnja, Kristijan Zimmer.

Apologies to all who I accidentally left out, and many thanks to all the subscribers of the Wget mailing list.


Go to the first, previous, next, last section, table of contents.