|
Learning The Basic Terminology Of Robot.txt Files
By Patrick Hare
Expert Author
Article Date: 2009-06-23
On many occasions customers come to us with the complaint that they can't be found. They either had rankings on all search engines and suddenly disappeared, or never were seen in the first place. Believing that they are the victims of a ban in the search engines, they come to us for search engine optimization advice. In many cases, the culprit is found in the robots.txt file, in the form of the classic:
User-agent: * Disallow: /
(Special Note: Using this command will make your site disappear in the search engines!) The forward slash after the disallow tells the engines to ignore all files. The soluton to this problem is to delete the forward slash, which tells search engines that everything is fair game. If you use Google Webmaster Tools, you will be told that the robots file prevents the indexing of your site. Many times a webmaster will upload this accidentally, or forget to take it down when a dev site goes live. The command effectively tells every honest search engine spider to stop reading your site and go away. Note that unethical spiders that scrape for phone numbers, email addresses, and content will not even bother to look at your robots.txt file, unless they are programmed to look for the files you don't want found. If you are looking to block search spiders from dishonest people on the internet, the robots.txt file is probably not going to help you, so you should look to server level exclusions.
Depending on the complexity of your site, the robots.txt file can be modified to support your SEO initiatives. If you have a series of pages in a shopping cart, forum, or section that you want to exclude, you can disallow a specific directory:
Disallow: /Example
If you have multiple directories, you would just add them to the list:
User-agent: Disallow: /Example Disallow: /secret_plans Disallow: /things_we_do_not_want_the_world_to_know
or you can use a newer wildcard format that disallows pages with certain phrases of string segments in them. If you wanted to disallow all the pages with a session ID in them, you could use a command that says:
Disallow: /*sessionid
Keep in mind that this will effectively shut out search engines for these pages, so you should ensure that your string is long enough that it does not accidentally blind the engines to pages that you want to get found. The wildcard robots disallow is ideal for people who may have bought sites and then found out that the site was a parked domain with thousands of "junk" pages installed by a previous owner. Even if you don't have any of those pages on your site, it can take months for Google to notice that they no longer exist. By excluding them in your robots file, the removal of those cached pages can take less time.
In the past, people have disallowed the /images directory but normally we don't recommend this. Image and universal search features on search engines allow for your images to get indexed, and this leads to traffic. One of our clients made a substantial number of sales based on image search, so excluding this directory should be done with some thought.
If you want to exclude certain search engines, or direct them away from certain directories, it is easy to set up separate exclusion protocols in the file. For instance, excluding Yahoo! (which uses the "Slurp" robot") from seeing a directory would be done this way:
User-agent: slurp Disallow: /Example
If you want to see a list of all the useragents (a useragent is essentially the name of the robot) you can exclude, this site has a nice database.
Finally, you may want to tell the search engines about your XML Sitemap, if you haven't already submitted it through Webmaster Tools. Doing this is easy, since all you have to do is add the command:
Sitemap: http://www.example.com/sitemap.xml
To the bottom of the robots.txt file.
For most people with a normal site, the whole robots.txt file should look like this:
User-agent: * Disallow:
Sitemap: http://www.example.com/sitemap.xml
There are quite a few great online resources that will guide you through tips and tricks regarding the sitemap.txt file. Smaller sites only need a basic sitemap file, which only need to be modified if search engines have trouble finding pages, or crawl too deep and need to be excluded, slowed down, or properly directed. Larger sites will want to look into protocols that can keep the search engines on the right pages, prevent duplicate content issues, and even keep unnecessary files from getting added to search results. Even though the Robots.txt file gets overlooked by many webmasters, we have seen that Google and other engines may be looking at it many times a day, so any big changes to your site should at least include a review of the robots file. Finally, if your site vanishes from all three search engines at the same time, the Robots.txt file should be the first place you want to look before checking out potential new webmasters.
[Note:For more advanced information on the Robots Exclusion Standard, Wikipedia has some good information on this topic.]
Comments
About the Author:
Patrick Hare has been managing online and offline marketing projects since 1999. From 2005 to present, he has been with Scottsdale Arizona's Web.com Search Agency (formerly Submitawebsite). Patrick provides Search Engine Optimization and Marketing advice to in-house customers and Web.com Jacksonville’s web design group.
|
|