|
When I first started writing my first website, I did not really think that
I would ever have any reason why I would want to create a robots.txt file. After all, did
I not want search engine robots to spider and thus index every document in my site? Yet
today, all my sites, have a robots.txt file in their root directory. This article
explains why you might also want to include a Robots.txt file on your sites, how you can
do so, and notes some common mistakes made by new webmasters with regards the ROBOTS.TXT
file.
For those new to the robots.txt file, it is merely a text file implementing
what is known as the Standard for Robot Exclusion. The file is placed in the main
directory of a website that advises spiders and other robots which directories or files
they should not access. The file is purely advisory - not all spiders bother to read it
let alone heed it. However, most, if not all, the spiders sent by the major search
engines to index your site will read it and take cognizance of the rules contained within
the file.
Why is a Robots.txt File Important?
What is the purpose of a robots.txt file?
- It Can Avoid Wastage of Server Resources
At the date of this writing, as far as I know, many of the search engine spiders do
not bother to index the scripts on your site (such as your CGI or PHP scripts). However,
there are those that do, including one of the major players, Google.
For robots or spiders that actually index scripts, they will actually call your
scripts just as a browser would, complete with all the special characters. If your site
is like mine, where the scripts are solely meant for the use of humans and serve no
practical use for a search engine (why should a search engine need to invoke my
site-navigation script? - it can just crawl the direct links), you may want to block
spiders from the directories that contain your scripts. For example, I block spiders from
my CGI-BIN directory. Hopefully, this will reduce the load on the web server that occurs
when scripts are executed by removing unnecessary executions.
Of course there are the occasional ill-behaved robots that hit your server at high
speed. Such spiders can actually bring down your server or at the very least slow it down
for the real users who are trying to access it. If you know of any such spiders, you
might want to exclude them too. You can do this with a robots.txt file. Unfortunately
though, ill-behaved spiders often ignore robots.txt files as well.
- It Can Save Your Bandwidth
If you look at your website's web logs, you will undoubtedly find many requests for
the robots.txt file by various search engine spiders. If, like me, you have a customized
404 document (which loads each time a visitor tries to retrieve a page that does not
exist on your site), you will find that the robot will wind up requesting for that
document instead, if you don't have an existing robots.txt file. My site has a fairly
large 404 document, with the result that the spiders wind up loading it repeatedly
throughout the day, adding to my already large bandwidth problems. In such a case, having
a small robots.txt file may save you some bandwidth (yeah, I know, it's not that
much).
Some spiders may also request for files which you feel they should not. For example,
one search engine requests for graphic files (".gif" files") on my sites. Since I see
little reason why I should let it index the graphics on my site, waste my bandwidth, and
possibly infringe my copyright, I ban it (and in fact all spiders) from my graphic files
directory in my robots.txt file.
- It Removes Clutter from your Web Statistics
I don't know about you, but one of the things I check from my web statistics is the
list of URLs that visitors tried to access, but met with a 404 File Not Found Error.
Often this tells me if I made a spelling error in one of the internal links on one of my
sites (yes, I know - I should have checked all links in the first place, but mistakes do
happen).
If you don't have a robots.txt file, you can be sure that /robots.txt is going to
feature in your web statistics 404 report, adding clutter and perhaps unnecessarily
distracting your attention from the real bad URLs that need your attention.
- Refusing a Robot
Sometimes you don't want a particular spider to index your site for some reason or
other. Perhaps the robot is ill-behaved and spiders your site at such a high speed that
it takes down your entire server. Or perhaps you prefer that you don't want the images on
your site indexed in an image search engine. With a robots.txt file, you can exclude
certain spiders from indexing your site with a robots.txt directive, provided the spider
obeys the rules in that file.
How to Set Up a Robots.txt File
Writing a robots.txt file could not be easier. It's just an ASCII text file that you
place at the root of your domain. For example, if your domain is www.yourdomain.com, you
will place the file at www.yourdomain.com/robots.txt.
The file basically lists the names of spiders on one line, followed by the list of
directories or files it is not allowed to access on subsequent lines, with each directory
or file on a separate line. It is possible to use the wildcard character "*" instead of
naming specific spiders. When you do so, all spiders are assumed to be named. Note that
the robots.txt file is a robots exclusion file (with emphasis on the "exclusion") - there
is no way to tell spiders to include any file or directory.
Take the following robots.txt file for example:
User-agent: *
Disallow: /cgi-bin/
The above two lines, when inserted into a robots.txt file, inform all robots (since
the wildcard asterisk "*" character was used) that they are not allowed to access
anything in the cgi-bin directory and its descendents. That is, they are not allowed to
access cgi-bin/whatever.cgi or even a file or script in a subdirectory of cgi-bin, such
as /cgi-bin/anything/whichever.cgi.
If you have a particular robot in mind, such as the Google image search robot, which
collects images on your site for the Google Image search engine, you may include lines
like the following:
User-agent: Googlebot-Image
Disallow: /
This means that the Google image search robot, "Googlebot-Image", should not try to
access any file in the root directory "/" and all its subdirectories. This effectively
means that it is banned from getting any file from your entire website.
You can have multiple Disallow lines for each user agent (ie, for each spider). Here
is an example of a longer robots.txt file:
User-agent: *
Disallow: /images/
Disallow: /cgi-bin/
User-agent: Googlebot-Image
Disallow: /
The first block of text disallows all spiders from the images directory and the
cgi-bin directory. The second block of code disallows the psbot spider from every
directory.
It is possible to exclude a spider from indexing a particular file. For example, if
you don't want Google's image search robot to index a particular picture, say,
mymugshot.jpg, you can add the following:
User-agent: Googlebot-Image
Disallow: /images/mymugshot.jpg
Remember to add the trailing slash ("/") if you are indicating a directory. If you
simply add
User-agent: *
Disallow: /privatedata
the robots will be disallowed from accessing privatedata.html as well as
privatedataandstuff.html as well as the directory tree beginning from /privatedata/ (and
so on). In other words, there is an implied wildcard character following whatever you
list in the Disallow line.
Where Do You Get the Name of the Robots?
If you have a particular spider in mind which you want to block, you have to find out
its name. To do this, the best way is to check out the website of the search engine.
Respectable engines will usually have a page somewhere that gives you details on how you
can prevent their spiders from accessing certain files or directories.
Common Mistakes in Robots.txt
Here are some mistakes commonly made by those new to writing robots.txt rules.
- It's Not Guaranteed to Work
As mentioned earlier, although the robots.txt format is listed in a document called "A
Standard for Robots Exclusion", not all spiders and robots actually bother to heed it.
Listing something in your robots.txt is no guarantee that it will be excluded. If you
really need to protect something, you should use a .htaccess file to password-protect the
directory (if you are running your site on an Apache server).
- Don't List Your Secret Directories
Anyone can access your robots file, not just robots. For example, typing
http://www.google.com/robots.txt will get you Google's own robots.txt file. I notice that
some new webmasters seem to think that they can list their secret directories in their
robots.txt file to prevent that directory from being accessed. Far from it. Listing a
directory in a robots.txt file often attracts attention to the directory. In fact, some
spiders (like certain spammers' email harvesting robots) make it a point to check the
robots.txt for excluded directories to spider.
- Only One Directory/File per Disallow line
Don't try to be smart and put multiple directories on your Disallow line. This will
probably not work the way you think, since the Robots Exclusion Standard only provides
for one directory per Disallow statement.
It's Worth It
Even if you want all your directories to be accessed by spiders, a simple robots file
with the following may be useful:
User-agent: *
Disallow:
With no file or directory listed in the Disallow line, you're implying that every
directory on your site may be accessed. At the very least, this file will save you a few
bytes of bandwidth each time a spider visits your site (or more if your 404 file is
large); and it will also remove Robots.txt from your web statistics bad referral links
report. by Christopher Heng
|