Additional Resources
General ROBOTS.TXT Information
Martijn Koster’s site about the Robot Exclusion Protocol.
This is the official definitive site on robot exclusion.
Who’s Knocking on the Door?
This article by Rhoda Schueller presents a very good explanation of why a web developer would want to utilize robot exclusion.
Spider Spotting Chart
(Link is no Longer Available) This is a simplified chart provide by the search engine watch. All of this information is available from the RoboGen database.
Server-side Robot Enforcement and Fighting Spam
Note: These ideas are long term spam war ideas, not a quick-fix filtering solution.
Robotcop, robots.txt:it’s the law
(Link is no Longer Available) Robotcop is an open-source module for webservers which helps webmasters enforce the disallow rules in the robots.txt file. Basically, if a spider reads the robots.txt file and attempts to load any of the disallowed files, its IP will be firewalled from viewing any pages at all. This has implications for fighting the robots that harvest e-mail addresses for spamming.
Using Apache to stop bad robots.
(Link is no Longer Available) Block the spiders that collect e-mail addresses for spamming purposes. Create a “honey pot” for crawlers and list it in your robots.txt file. If a crawler picks it up, they are caught red-handed! The method is not fool-proof, but interesting. Firewall rules and robotcop take this idea to the next level.
Fighting Spam with DNS
(Link is no Longer Available) Not directly related to robots.txt, but has implications for fighting spam in general.
RoboGen Links
RoboGen Help File
The documentation for the RoboGen software is also available online.
BORIS THE SPIDER STRIKES AGAIN
(Link is no Longer Available) (Boardwatch Sept. 1999). RoboGen is mentioned on the second page of this article on web spiders by Thom Stark.
Free Webmaster Tools, Issue 11.
(Article no longer available online) The editors for this eZine selected RoboGen as the #1 web site for the week in the July 1999 issue.