Ad

 WWW.CHECKLISTING.COM


Friday, February 10, 2012

Creating a robots.txt file


Website owners that wish to instruct search engines on which pages to crawl and index on their website's can use a robots.text file to do so. It works like this:
a search engine robot wants to vists a Web site URL, say http://www.yourdomain.com/welcome.html. Before it does so, it firsts checks for http://www.yourdomain.com/robots.txt, and looks to see if there are specific directives to follow. Let's say it finds the following code within the robots.txt file:

User-agent: *
Disallow: /


The "User-agent: *" means this is a directive for all robots. The * symbol means all. The "Disallow: /" tells the robot that it should not visit any pages on the site. There are two important considerations when using /robots.txt:
Robots can ignore your /robots.txt. Especially rogue robots that scan the web for security vulnerabilities to infect sites with malware, and email address harvesters used by spammers will pay no attention.
The robots.txt file is available to the public to view. Anyone can see what sections of your server you don't want robots to use so don't try to hide information within the file.
Creating a robots.txt file
Your robots.txt file should be in the top level of your document root for your domain. In our server's configurations this would be the public_html folder in your account. If your domain is "yourdomain.com" then the bots will look for the file path http://yourdomain.com/robots.txt so it is imperative to place the file correctly. If you have add-on domains and want to use a robots.txt file in those as well you will need to place a robots.txt file in the folder you specified as the root for the add-on domain.
The robots.txt file is a basic text file with one or more records. So let's go over the basics. You will need a line for every URL prefix you want to exclude. You cannot have blank lines in a record since the blank space is used to separate multiple records.
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~test/

In the example above we have told ALL robots (remember the * means all) to not crawl three directories on the site (cgi-bin, tmp, ~test). You can exclude whatever directories you wish, and it can depend on how your website is structured. If you do not specify files or folders to be excluded it is understood the bot then has permission to crawl those items.

To exclude ALL bots from crawling the ENTIRE server:
User-agent: *
Disallow: /

To allow ALL bots to crawl the ENTIRE server:
User-agent: *
Disallow:

To exclude A SINGLE bot from crawling the ENTIRE server:
User-agent: BadBot
Disallow: /

To allow A SINGLE bot to crawl the ENTIRE server:
User-agent: Google
Disallow:

User-agent: *
Disallow: /

To exclude ALL bots from crawling the ENTIRE server except for one file:
This can be tricky since there's no 'allow' directive in the robots.txt file. What you have to do is place all the files you do not want to be crawled into one folder, and then leave the file to be crawled above it. So if we placed all the files we didn't want crawled in the folder called MISC we'd write the robots.txt rule like this:
User-agent: *
Disallow: /MISC

Or you can do each individual item like this:
User-agent: *
Disallow: /MISC/junk.html
Disallow: /MISC/family.html
Disallow: /MISC/home.html

To create a Crawl Delay for the ENTIRE server:
An alternative to blocking a search engine is to request their robots to not crawl through your site as quickly as they normally would. This is known as a crawl delay.  It's not an official extension to the robots.txt standard but one that most popular search engines use. This is an example of how to specify that robots crawling your site can only make one request every 12 seconds:
User-agent: *
Crawl-delay: 12

0 comments: