You can specify which sections of your site you would like search
engines and web crawlers to index, and which sections they should
ignore. To do this, you create directives in a robots.txt file, and
place the robots.txt file in your public_html document root directory.
Using robots.txt directives
The directives used in a robots.txt file are straightforward and easy to understand. The most commonly used directives are User-agent, Disallow, and Crawl-delay. Here are some examples:
To exclude all robots from the entire server
User-agent: * Disallow: /
To allow all robots complete access
User-agent: * Disallow:
(or just create an empty "/robots.txt" file, or don't use one at all)
To exclude all robots from part of the server
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
To exclude a single robot
User-agent: BadBot Disallow: /
To allow a single robot
User-agent: Google Disallow: User-agent: * Disallow: /
To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:User-agent: * Disallow: /~joe/stuff/Alternatively you can explicitly disallow all disallowed pages:
User-agent: * Disallow: /~joe/junk.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html
List of user-agents
The robots.txt website has compiled a list of user-agents by which you can specify above, to know more please visit their webpage here.
Why would someone want to disallow a search engines to index your content?
Well, first of all, the website might be private and you do not want the files to appear on search engines. Secondly – you might need to protect your site from unwanted bots, which may significantly increase your site’s server resource consumption by flooding it with their requests.