How to control search engines and web crawlers using the robots.txt?

  • August 15, 2017
  • 0 Comments

You can specify which sections of your site you would like search engines and web crawlers to index, and which sections they should ignore. To do this, you create directives in a robots.txt file, and place the robots.txt file in your public_html document root directory.

Using robots.txt directives

The directives used in a robots.txt file are straightforward and easy to understand. The most commonly used directives are User-agent, Disallow, and Crawl-delay. Here are some examples:

To exclude all robots from the entire server
User-agent: *
Disallow: /

To allow all robots complete access
User-agent: *
Disallow:

(or just create an empty "/robots.txt" file, or don't use one at all)

To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: Google
Disallow:

User-agent: *
Disallow: /
To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:
User-agent: *
Disallow: /~joe/stuff/
Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

List of user-agents

The robots.txt website has compiled a list of user-agents by which you can specify above, to know more please visit their webpage here.

Why would someone want to disallow a search engines to index your content?

Well, first of all, the website might be private and you do not want the files to appear on search engines. Secondly – you might need to protect your site from unwanted bots, which may significantly increase your site’s server resource consumption by flooding it with their requests.


How helpful was this article to you?