As a webmaster there will almost certainly come a time when you need to know how to prevent search engines like Google from indexing a particular page on your site. Preventing a page from being indexed and displayed in the search engine results pages (SERPs) is not that difficult. The various common solutions, however, come with varying degrees of effectiveness.
Why would you want to prevent Google from indexing a page?
One might ask, “Why would I want to prevent Google from indexing my URL?” That’s a great question. As hard as it is these days to get pages indexed and keep them indexed, it seems to almost defy logic to want to prevent certain pages on your site from being indexed. Believe me. If you stay in this business long enough then you will eventually need to do so, and it’s important that you know how to do it effectively.
There are lots of reasons why a webmaster might want to prevent Google from indexing a page:
- The page might contain sensitive information that the webmaster does not want to be searchable in the Google index.
- The page might contain duplicate content like that of a print friendly page. Print friendly pages are often missing site navigation links and can provide a bad user experience when clicked on in the SERPs.
- Maybe the page is only going to be up for a short period of time and the webmaster does not want to deal with the 404 errors it will produce when they take the page down nor with creating the 301 redirects required to fix the 404s.
- Perhaps the webmaster wants to allow users to test an entirely new beta version of their site under a /beta folder on their existing web without the beta version being indexed because it’s going to move later.
The list of reasons goes on and on.
Ways webmasters attempt to prevent indexing
Webmasters use lots of different techniques to prevent certain URLs from being indexed and displayed in the SERPs. Some are effective. Some are not. The methods used tend to boil down to the following:
- meta robots
I have listed the above methods to prevent indexing at Google in order of effectiveness. Let’s explore each in more detail.
Rel=”nofollow” to prevent indexing
Contrary to popular belief, using the rel=”nofollow” attribute when linking to a page from other pages on your site will NOT prevent the target page from being indexed by Google. This approach is very ineffective for a couple of key reasons.
While it will prevent Google from discovering and indexing the target page because of the nofollowed links, this approach does not guarantee that all links to the page will contain a rel=”nofollow” attribute. It only takes a single followed link to get the page indexed by mistake.
There is a chance that at some point in the future you may forget and throw up a followed link to the page on your own site. Even if you always remember the rel=”nofollow” attribute when linking to the page, you cannot prevent other webmasters from linking to the page with followed links. So never count on rel=”nofollow” to prevent indexing by Google or any other search engine.
Robots.txt to prevent indexing
While disallowing a URL using your robots.txt file will prevent Google from crawling and indexing the contents of a page, it will NOT necessarily prevent Google from displaying the URL in their search results. Google does sometimes show URLs in their SERPs that are not indexed. At first glance it may appear in these cases that Google is ignoring the robots.txt file, however from Google’s perspective they are still obeying robots.txt.
Google sometimes shows pages that are blocked by robots.txt in their SERPs when they know about the blocked URL because of inbound links from other sites and feel that the page would be a good result for the user based on the link text from those other inbound links. Even though Google might not know the contents of the page, with enough inbound links with link text matching the user’s search phrase they can often say with confidence that the URL is relevant to the user’s search. Though they might show these blocked URLs in the SERPs, Google feels they are still obeying robots.txt since they are not crawling or indexing the content of the page.
When displaying URLs blocked by robot.txt, the snippet is typically missing in the SERP listing. Only a constructed title and the URL appear in most cases. If the URL, however, has a DMOZ (Open Directory Project) entry then Google may include the DMOZ description as the snippet as well.
So if you are trying to prevent Google from indexing your URL and do not want it shown in the SERPs, using robots.txt is also not a very effective approach either.
Meta robots to prevent indexing
If you do not want your page indexed and would also like to prevent it from showing up in Google’s search results then using the meta robots tag is the most effective method of achieving both of these goals. Simply include the following meta robots tag within the head of your web page:
<meta name=”robots” content=”noindex”>
And do not block the URL using robots.txt. Let the spiders crawl the page in question. This will prevent the URL from being indexed and prevent it from being shown in their SERPs.
What if your page is already in the Google’s Index or showing in the SERPs?
If your page has already been indexed by Google and/or is showing in the SERPs, you can have it removed by doing the following:
- Add the above meta robots tag to prevent it from being re-indexed.
- Verify the URL is not blocked by robots.txt.
- Submit a URL removal request via Google’s Webmaster Tools.
Submitting a URL removal request will cause Google to drop the URL from their index. Adding the meta robots tag with noindex and making sure the URL is not blocked by robots.txt will prevent the URL from being re-indexed and/or displayed it in the SERPs in the future.