When a search engine or robot hits a site the first thing it will look for is a robots.txt file. Remember to keep this file in the root directory.
Example: http://www.domain.com/robots.txt
This will insure that the robot will be able to find the file and use it correctly. This file will tell a robot what to spider. This system is called “The Robots Exclusion Standard”.
Robots.txt Format
The format for a robots.txt file is a special format but is very simple. It consists of a “User-agent:” line and a “Disallow:” line.
The “User-agent:” line refers to the robot. The “User-agent:” line can also be used to refer to all robots.
Here are a few examples:
To disallow all robots from indexing a certain folder on a site, we’ll use this.
User-agent: *
Disallow: /cgi-bin/
For the User-agent line we used a wild card “*” to refer to the robot, which tells all robots to listen to this command. So once a robot reads this, it will then know that the /cgi-bin/ should not be indexed at all. This will include all folders contained in it.
Specifying certain bots is also allowed and in most cases very useful to users that utilize doorway pages or other ways of search engine optimization. Specifying certain bots will allow a site owner to tell a spider where to index and what not to index.
Here is an example of restricting access to the /cgi-bin/ from Google:
User-agent: googlebot
Disallow: /cgi-bin/
This time with the User-agent command we used googlebot instead of the wildcard command “*”. This lets the Google robot know we’re talking to it specifically and not to index this folder or file.
White Space & Comments
White spaces and comment line can be used but are not support by most robots. When using a comment it is always best to add it to another line.
Example: 1
User-agent: googlebot #Google Robot
Example: 2
User-agent: googlebot
#Google Robot
Notice on the first one, the comment line is on the same line, indicated by a # then the comment. While this is ok and will be accepted in most cases, allot of robots may not utilize this. So be sure to use example 2 when using comments.
In most cases if Example 1 is used and a robot does not support it, the robot will interpret the line as “googlebot#GoogleRobot”. Instead of “googlebot” like we originally intended.
White spaces refer to using a blank space in front of a line in order to comment it out. It is allowed but not always recommended.
Common Robot Names
Here are a few of the top robot names:
Googlebot – Google.com
Inktomi Slurp – HotBot.com
IA Archiver – Alexa
AskJeeves – AskJeeves.com
These are just a few common robots that will hit a site at any given time.
Examples
The following examples are commonly used commands for robots.txt files:
User-agent: *
Disallow:
The following allows all robots to index an entire site. Notice the “Disallow:” command is blank; this tells robots that nothing is off limits.
User-agent: *
Disallow: /
This tells all robots not to index anything on a site. We used “/” in the “Disallow:” function to specify the entire contents of a root folder to not be indexed.
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /downloads/
Disallow: admin.php
The following tells all robots; specified by the wildcard command in the “User-agent:” function, to not allow the cgi-bin, images, and downloads folder to be indexed. It also doesn’t allow the admin.php file to be indexed, which is located in the root directory. Subdirectory files and folders can also be used in this case.
User-agent: googlebot
Disallow: /cgi-bin/
This tells the Google Bot not to index the cgi-bin folder.
Conclusion
More information on robots.txt files can be found at Robotstxt.org. Also remember that all the major sites will use a robots.txt file. Just punch in a URL and add robots.txt file to the end to find out if a site uses it or not. It will also display their robots.txt file in plain text so anyone can read it. Remember that the robots.txt file isn’t needed. It’s mainly used to tell spiders what to index and what not to index. If everything is to be indexed on a site, a robots.txt file isn’t needed.
