Rietta
Rietta: Web Apps Where Security Matters

Robots.txt Tutorial

This product is legacy software that is no longer maintained, nor supported by Rietta Inc. This page is preserved for historical purposes. See the listing of Rietta’s legacy software for a complete list of this software.

Search engines use spider programs, also known as robots or crawlers, to create the indexes for their search databases. These crawlers follow links to find new and updated pages for the search engine. Before a website is indexed by a search engine’s spider, the special file named “robots.txt” is first retrieved from the website’s document root. So for example, if a search engine’s robot is about to index http://example.com**, it will first fetch http://example.com/robots.txt**. The format is defined by the Robot Exclusion Standard, and RoboGen is an editor for these files.

Simple Example

1
2
3
4
5
6
7
8
9
10
User-agent: gooblebot
      Disallow: /images/
      Disallow: /projects/
      Disallow: contact.html

# This is a comment

User-agent: *
      Disallow: /support/
      Disallow: contact.html

LDIF Tag Format and Comments

All tags are specified in LDIF format, which means that a tag is specified by a name followed by a colon (:) followed by the value. Only one tag can appear per line. Lines beginning with a pound sign (#) are comments and are ignored.

User-agent

The User-agent tag starts a rule section for a particular spider program. The special * user-agent applies to all spiders, except the spiders for which specific sections exist.

Almost all programs, including Internet Explorer and Mozilla, which access web pages have user-agent names, which can often be seen in the web access log. It would have no effect to define sections in a robot exclusion file for web browsers and other user-agents that are not automated spider programs.

Disallow

A user-agent section contains one or more disallow lines. Each line specifies a file or directory which is not to be indexed by the specified crawler program.

For example, to block the contact.html page from being included in an index use the following:

Disallow: contact.html

To block the contents of the images directory from being indexed use the following:

Disallow: /images/

According to the robot exclusion standard, disallow rules are treated as path prefixes. This means that disallowing /images would block any path which starts with /images, which means that both /images.html and /images/something.jpg would not be listed in any index.

Not a Security Mechanism

It is very important to remember that robots.txt files provide absolutely no security for a web site. Spiders operated by some groups, such as spammers, will simply ignore the contents of the file. Perhaps worse is that spammers and hackers will look in the robots.txt file for addresses that might have otherwise remained hidden. If you have protected content, you must use actual security mechanisms such as, but not limited to, password protected directories. Such measures are outside of the scope of the robot exclusion standard and this tutorial. Consider yourself reminded!