Best Robots.txt Guide to Get Higher Index Rate

Maybe you heard before about this very small file, but you never know about its powerful instructions that probably block all search engine crawler bots which lead to de-index all your site content and dragging you to the bottom of the world.
It is wonderful when search engines frequently visit your site and keep indexing your new published content bringing you more fresh organic traffic but there is some area in your site you shouldn’t let crawler bots get inside to prevent the duplicate content issues or you have sensitive data that you don’t want to show for anyone and prevent index these data.
Robots.txt is the simplest way to tell search engine crawler bots where they allow to go or they are not allowed.

What is Robots.txt File?

Robots.txt is a text file created by webmasters to guide search engine crawler bots (Google, Yahoo, Bing, Ask, AOL, Baidu, and Yandex) on how to crawl and index their pages. It’s a very simple text file placed in the root folder (directory) on your site.
Its uses the Robots Exclusion Standard Protocol with simple commands lines that can be used by websites to communicate with web crawlers and other web robots.

Hint: Location and Link of Robots.txt file should be “https://example.com/robots.txt”

how to crawler my site

Should You Have a Robots.txt?

If you would like to crawl and index your whole website when actually you don’t need a robots.txt file at all. Or as Google documents said “You only need a robots.txt file if your site includes content that you don’t want Google or other search engines to index.”

But there are many Reasons are suitable for using robots.txt file

You need to hide some content from search engines
You have sensitive data that you don’t want to show to the world or even index
You have download page for products or application and you don’t need Google to find it
You have redirects rules by any wordpress plugin and need to hide these redirect pages from bots
Your site is a live but still in developing stage and you don’t want crawler bots to find it or index it yet
You have two versions of your site (viewing or browsing one and another one for printing), and you need to exclude printing version from crawling

As you can see robots.txt file has powerful endless list of instructions for how to access your site, it all depends on your needs.

Limitation and Instructions

As we mentioned previously, adding wrong commands to the robots.txt file can greatly hurt your site index. So you have to learn the basics of robots.txt files to guide search engine bots correctly.

User-agent:

is the name of the robot who should applied for directions rules

User-agent: *
it means, all robots from all search engines should apply for following directives

User-agent: Googlebot
it means, following directions apply only for Google Bots

Other Google user agent commands
User-agent: Googlebot-Image
User-agent: Adsbot-Google
User-agent: Googlebot-Mobile

Disallow:

Anything will follow the “Disallow” command, and will not crawl, find access, or even index.
So, you have to be very careful of this command because it’s very harmful. At the same time, it’s helpful to exclude certain folders from the index on your site.

For example: you have folder called “log” and you don’t like to seen or find by any robots

User-agent: *
Disallow: /log

These two lines tells all robots (User-agent: *) to NOT access or index “log” folder (Disallow: /log)

Allow:

Everything comes after “Allow” command will index and discovered by all robots.
Lets complete above example for “log” folder but also there is an image file called “round.png” inside “photos” folder that you are using it to display something in your layout.

User-agent: *
Disallow: /log
Allow: /photos/round.png

Above three lines have the following instructions

I am Talking to all robots (User-agent: *)
Not access or index “log” folder (Disallow: /log)
Access, index and display “round.png” image (Allow: /photos/round.png)

See, it’s very simple commands and is flexible to do anything you need for controlling access to your site.

Sitemap:

Its common practice to add your xml sitemap access link (not html) at the end of your robots.txt file to be discovered fast by all search engines as following

Sitemap: https://YourDomain.com/sitemap.xml

How to Create a Robots.txt File?

There is no much experience needs to make robots.txt file, its very simple txt file that you can create it by any plain text editor like “Windows Notpad” then upload it to your root folder of your site (same directory of .htaccess file)
Hint: don’t forget to test your robots.txt file for access your pages by Google robots.txt Tester tool in Google Webmaster Tools.

how to create robots txt file

Google robotstxt Tester tool

What is the best Instructions for Robots.txt File?

There are no instructions that fit all webmaster’s needs because each site has its own theme, layout, web server construction, plugin rules, etc.
If you browse robots.txt file for each site you visit, you will find a lot of variations from each one to give them what they need. Same as you, you have to write notes first for instructions you need before you upload the robots.txt file to your server.
Here is some basics instructions:

Must be named as robots.txt not Robots.TXT
Robots.txt must saved as text file.
Must be placed in the root of your domain (highest-level directory of your site).
WordPress users must disallow “cgi-bin” , “wp-admin” and “trackback” folders as Following

User-agent:  *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /trackback/

Google Panda 4 Update and blocking your Resources (CSS & JS)

In past, it was common practice to disallow the resources folder “/wp-content/” which contains all your images, stylesheets, and javascript to save bandwidth or any other reasons which are completely wrong.
After Google Panda 4 Update many webmasters have been hit because they blocked GoogleBots to render their websites correctly which lead to the following error

Hint: Google requirements that all JavaScript and CSS files that responsible for your site’s layout are not blocked.

block googlebot fetch render

How to Check if GoogleBots has Access to Render Your Site Correctly?

In Google Webmaster Tool Home page, choose the site need to be check.
Expand the Crawl heading on the left dashboard, and select “Fetch as Google” tool.
Click on “FETCH AND RENDER” Red Button.
Wait to Complete Fetch process and has black “Right” sign.
Once completed, click on green “Right” sign.
You will see two separated windows (one for Googlebot and one for visitor view)
Check for any blocked resources errors appear or any difference in the layout for both windows then fix it as soon as possible to get higher index rate.

Check GoogleBots Access to Render Your Site

Success googlebot render

Robots.txt Best Practices

Check Google Webmaster Tools Crawler errors because they may fix easily by simple line add to robots.txt file.
Be careful while writing your robots.txt file because single mistake may lead to your site invisible to search engines.
Check robots.txt file after each plugin installed because there some plugins add rules to your file which may conflict with your robots.txt file rules.
Its highly recommend to be sure that Googlebot can access any resource files that meaningfully contributes to your site’s visible content or its layout.