Everything You Need to Know about Robots.txt

Comments Off on Everything You Need to Know about Robots.txt, 15/02/2012, by , in General, Plugins and Tools

Robot LaptopInvented in 1994 by Martijn Koster, the robots.txt protocol, also known as the Robots Exclusion Protocol or the Robot Exclusion Standard is used to prevent the web crawlers from search engines as well as other types of web robots from being able to access and index parts of or the entirety of a website that otherwise would be accessible by the public.  Robots.txt is used by search engines to better categorise as well as archive web sites, and webmasters will use it as a way to proofread source code.  While robots.txt can be very helpful, since it is a de facto standard and was reached on consensus, it does rely on search engines voluntarily choosing to obey the instructions within.  This means that web crawling robots can be instructed to ignore your robots.txt file completely.  This can lead to security issues for your website including the following:

  • Pages listed in the robots.txt file could provide unintended access to parts of your website
  • Email addresses can be harvested and used by spammers
  • Anyone can see what parts of your site you are trying to hide since robots.txt is a publicly available file.

Ways to Create a robots.txt File

A robots.txt file can be created by any application that is capable of producing a text file, as long as said file is in plain text, or the .txt format.  For Microsoft Windows users, this can be done using either the notepad or wordpad.exe applications.  Macintosh users can use TextEdit to create robots.txt, and users on a Unix or Linux system can use vi or emacs.  Once you have created a robots.txt file, it should be placed in the top level directory on your web server.

Why is robots.txt So Important?

The biggest advantage is that robots.txt can help you avoid wasting precious server resources.  This is because when a web crawler crawls your site to index it, any applications that you have, such as a contact form, search box, etc. will ask for access just as a browser would if a person was using it, which can put stress on your server.  A way around this would be to instruct robots.txt to ignore the directory with your scripts in it, since there is no need for your scripts to be indexed by a search engine.

You can also use a robots.txt file to cut down on bandwidth usage, since you will be able to stop web crawlers from requesting files that would use a lot of bandwidth being constantly accessed by web crawlers such as image files.  You can use robots.txt to restrict access to your image directory, so that directory and the files within will be restricted.

Pingler robots.txt Checker

By using web based tools, such as the Pingler robots.txt checker, you can access the data listed in a robots.txt file from any website.  This can be an incredibly useful SEO tool, since you can use it to alter what search engines see as well as check the robots.txt file of other websites to see what they have listed.






Tags:
robots.txt