Quite often you have multiple versions of your website; Development, Testing, Staging and Production. To prevent search engines from indexing the non-production versions of your web application you would enable username and password authentication right? This is the obvious thing to do…not only does it prevent search engines from indexing your development stuff but also prevents unauthorised access to stuff which is not live yet.

There will be a day when your client wants to test that Facebook or Twitter sharing function on your staging website, so off you go asking your IT administrator to “temporarily” disable the security. Boom there it is – This is the link to the World Wide Web that triggers the search engines to index your stuff.

You can of course remind yourself to ask the IT administrator to re-enable the security. But people forget and these things do slip through with unwanted consequences, such as diluted search engine rankings, duplicate content issues, and real users making e-Commerce purchases on staging!

Check out these Google search results that contains either dev, test or staging:

The Google Webmasters tools can help you remove unwanted content from search results. 

So how do you prevent search engines from indexing your non-production websites?

By using a robots.txt file. This file will tell the search engine crawlers what they can or cannot crawl on your website. The goal is to allow the search bots to crawl everything in your production website and nothing in your non-production websites.

Create two plain (ANSI encoded) text files.  Do not use Visual Studio to create the text files as they will be created using the UFT-8 encoding and will cause problems down the line,

2013 06 17 15 35 06

Edit robots.test.txt and add the following code:

#DO NOT INDEX ANYTHING ON THIS WEBSITE
User-agent: *
Disallow: /

Edit robots.live.txt and add the following code:

#INDEX EVERYTHING YOU CAN FIND ON THIS WEBSITE
User-agent: *
Disallow:

Did you spot the subtle difference between allowing and disallowing crawler access? The "Disallow: " (without the forward slash) allows access to all directories.  Get this wrong and your whole website will drop out of the search engine index! Take note.

Next, copy the files to your Visual Studio Project

2013 06 17 15 56 25

At this point it's probably worth checking if you have the IIS URL Rewriting module installed on the server.

Copy the following IIS rewriting rules to the web.config > system.webServer section:

<rewrite>
<rules>
<rule name="Rewrite LIVE robots.txt" enabled="true" stopProcessing="true">
<match url="robots.txt" />
<action type="Rewrite" url="/robots.live.txt" />
<conditions>
<add input="{HTTP_HOST}" pattern="^(www.)?tekcent.com" />
</conditions>
</rule>
<rule name="Rewrite TEST robots.txt" enabled="true" stopProcessing="true">
<match url="robots.txt" />
<action type="Rewrite" url="/robots.test.txt" />
<conditions>
<add input="{HTTP_HOST}" pattern="^(www.)?tekcent.com" negate="true" />
</conditions>
</rule>
</rules>
</rewrite>

After deploying the changes we can test the different versions of our website:

2013 06 17 17 45 36

 

2013 06 17 17 46 48

 

2013 06 17 17 47 33

Finally, this useful tool can check the validity of your robots.txt files.

Monday June 17, 2013, By Anton Pham


comments powered by Disqus

Share this page

Archive