How to control search engines and web crawlers using the robots.txt?

You can specify which sections of your site you would like search engines and web crawlers to index, and which sections they should ignore. To do this, you create directives in a robots.txt file, and place the robots.txt file in your public_html document root directory.

Using robots.txt directives

The directives used in a robots.txt file are straightforward and easy to understand. The most commonly used directives are User-agent, Disallow, and Crawl-delay. Here are some examples:

To exclude all robots from the entire server

User-agent: *
Disallow: /

To allow all robots complete access

User-agent: *
Disallow:

(or just create an empty "/robots.txt" file, or don't use one at all)

To exclude all robots from part of the server

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

To exclude a single robot

User-agent: BadBot
Disallow: /

To allow a single robot

User-agent: Google
Disallow:

User-agent: *
Disallow: /

To exclude all files except one

This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:

User-agent: *
Disallow: /~joe/stuff/

Alternatively you can explicitly disallow all disallowed pages:

User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

List of user-agents

The robots.txt website has compiled a list of user-agents by which you can specify above, to know more please visit their webpage here.

Why would someone want to disallow a search engines to index your content?

Well, first of all, the website might be private and you do not want the files to appear on search engines. Secondly – you might need to protect your site from unwanted bots, which may significantly increase your site’s server resource consumption by flooding it with their requests.