URL Canonicalization

by Canonical SEO on August 20, 2009

URL Canonicalization

Before I get into explaining what a canonical URL is and why URL canonicalization is important, I think it is imperative to understand that Google and other search engines rank URLs.  They do NOT rank web sites.  They do NOT rank web pages.  They rank URLs.  In other words, search engines treat each unique URL in their index as a different web page when they rank the URLs for particular keyword phrases. That being said…

What is a canonical URL?

Every page on your web site should be referred to using one and ONLY one URL.  This single URL for a page is called the canonical (preferred) URL.  There you have it!  Easy… breezy!  Sounds simple enough.

What is URL Canonicalization then?

URL canonicalization is the process of deciding on rules for determining the canonical URL for each and every page on your site (i.e. your site’s canonicalization policy) and then implementing code on your site to enforce those rules (i.e. canonicalization enforcement).

WTH should I care about URL canonicalization?

The short answer is that not having a canonicalization policy and code in place for canonicalization policy enforcement site-wide can drastically affect your rankings in the SERPs… in a VERY negative way.

The majority of webmasters with sites on the web are not very knowledgeable of SEO and suffer from canonicalization issues because they do not understand this most fundamental concept of the canonical URL. Hell, half of the people marketing themselves as SEO professionals on random forums on the web don’t even understand it or don’t realize the full extent to which such issues can wreck a site if left unchecked!

Coming up with a canonicalization policy and putting canonicalization enforcement code in place for that policy should be the first order of business that any webmaster or SEO should tackle for a new or existing site.

Why implement URL canonicalization?

Not implementing rules to enforce canonical URLs on your site will lead to issues including duplicate content and split link juice/split page rank.  This is not optimal from an SEO perspective. And if you’re an SEO then you do get paid to “optimize”, right?  Let me explain…

For example, you may have a page on your site – an index.html default document – that lives in a folder off the root of your web called somefolder. That page may be accessible using many different URLs such as:

http://example.com/somefolder
http://example.com/somefolder/
http://example.com/somefolder/index.html
http://www.example.com/somefolder
http://www.example.com/somefolder/
http://www.example.com/somefolder/index.html

Each of the above URLs are typically seen by the search engines as if they are different pages on your site and are represented as such in their database or index.  Each unique URL above is ranked separately based on its own merits – its own on-page and off-page ranking factors.

It’s easy to see that the lack of a canonical URL will cause duplicate content issues since each URL renders the exact same content.  Each URL in the search engine’s index will have the same content associated with it.  Not good… especially at Google!  At Google one of those URL will be seen as the originator of the content and the other URLs will have their content flagged as duplicate making it harder for the URLs flagged as duplicate to rank.  As you can see this is not optimal for your rankings, yet it is not the biggest problem caused by the lack of a canonicalization policy and its corresponding enforcement.

Even worse than duplicate content issues, the lack of URL canonicalization also leads to split link juice/split page rank issues.  Because there is no single canonical URL for the page in the above example, the non-canonical URLs are essentially taking credit for inbound links to your page that should instead be focused on a single canonical URL.

For example, if 20 sites link to the page above using http://example.com/somefolder/, an additional 20 sites link to the same page using http://example.com/somefolder/index.html, an additional 20 sites link to the same page using http://www.example.com/somefolder, and yet an additional 20 sites link to the same page using http://www.example.com/somefolder/index.html then  the search engines will see this as 4 different pages in their index with each page having 20 inbound links.  This is NOT desirable.  In fact it’s an SEO nightmare!

It would be much more advantageous from an SEO perspective for the search engines to view the above scenario as a single URL with 80 inbound links.  Remember… inbound links are THE most important ranking factor in any reputable search engine ranking algorithm.

What URL canonicalization rules should I implement?

That depends on your particular site.  The first step to implementing URL canonicalization is to decide on your URL canonicalization policy – the rules for determining the canonical URL of any page on your site.  Once you decide on the rules, you can then put measures in place to enforce them across all pages on the site.

So what are these decisions you need to make?  What “rules” should make up your URL canonicalization policy?  Well, as you saw in the scenario above, there are lots of things that can cause canonical issues.  So you need a plan for dealing with as many as possible (hopefully all of them that your particular site might encounter).  Decisions like the following must be made:

  • Whether to use the www or the non-www version of every URL on the site?
  • Whether to show the name of default documents in URLs or hide the default document name?
  • If hiding default document names in the URL then whether to show the trailing slash (‘/’) on names of folders containing default documents or hide the trailing slash on folder name?

It doesn’t really matter which answers you pick.  What is important is that you pick answers to these questions and then implement rules on your site to enforce them… and I cannot stress this enough… SITE WIDE!

Too many sites implement canonicalization enforcement through server-side scripting for their home page only and simply call it a day.  Or they take the easy way out and implement rules to force all URLs to either the www or non-www version of the URL and don’t deal with other important canonicalization issues caused by things like default documents, query string parameters, etc.

I typically choose to use the www version of my domain name. I always hide default document names, and always show the trailing slash for the folder name when running a default document.

So in the example above, I would select http://www.example.com/somefolder/ as my canonical URL.  But that is just my preference.  You may like http://example.com/somefolder/index.html as your canonical URL.  And this is perfectly fine.  The important thing is that you think about it ahead of time, come up with a canonicalization policy, and implement rules to enforce your decisions whatever they may be… and do it site-wide!

The decisions above are the most common things you need to consider, but there are other decisions as I eluded to previously that you may need to include in your canonicalization policy and enforcement plan as well.  Those might include things like which protocol can be used to access a particular page if you support SSL (i.e. which pages on your site should allow access using HTTP vs. HTTPS, where applicable) or how will you deal w/ query string parameters like “?aid=92831” on the end of the URLs below.

http://www.example.com/somefolder/?aid=92831
http://www.example.com/somefolder/?aid=93727
http://www.example.com/somefolder/?aid=93727&sort=asc
http://www.example.com/somefolder/?sort=asc&aid=93727

Query string parameters can be the death of a site from an SEO perspective, and you definitely need a strategy to deal with them from a canonicalization perspective if you use them in your URLs.  Keep your eyes peeled for my upcoming post on a solution for dealing with tracking codes in query string parameters.  But for now, back to the topic at hand…

What’s the ideal method for enforcing a canonical URL policy?

The ideal method of enforcing a URL canonicalization policy is to use 301 Permanently Moved redirects.  Learn it…  Know it… Live it!  I know… Some yahoo out there is going to say, “You can just use the new <link rel=”canonical”> element to specify a canonical URL!”  Don’t listen to that guy.  Just trust me when I say that it is a bad idea to do so for several reasons, the least of which is that the new <link rel=”canonical”> element was not created as a good solution for canonical issues.  Instead it was created as a “last resort” kind of solution for those sites with no better alternative.  Using 301 redirects is without a doubt the absolute BEST way to implement URL canonicalization. But that debate will have to be the subject of another post!

So here’s what you want to do…

In the above example, I explained that my preference is to always use the www version of all URLs, hide default document names, and include the trailing slash on folder names when running default documents.  To enforce these rules for creating any canonical URL on my site, I would implement 301 Permanent Moved redirects to handle the 5 non-canonical URLs below:

http://example.com/somefolder
http://example.com/somefolder/
http://example.com/somefolder/index.html
http://www.example.com/somefolder
http://www.example.com/somefolder/index.html

I would implement enforcement code such that any page requests for the 5 non-canonical URLs above will get 301 redirected to the canonical URL (in my example, http://www.example.com/somefolder/).

So now if the above 6 URLs (the 5 non-canonical URLs plus the canonical URL) have 20 inbound links each, the 301 redirects will transfer credit to the canonical URL (http://www.example.com/somefolder/) for all inbound links to the non-canonical URLs as well as transferring credit for the link text for all of those inbound links.  The search engines will now see one URL (the canonical URL http://www.example.com/somefolder/) with 120 inbound links rather than 6 different URLs with 20 inbound links each.

Go ahead… You can say it!  “That is freakin’ HUGE!”

Not only can this have dramatic affects on your rankings, but it also makes it so that the canonical URL always shows in the address bar of the user’s browser… even when they click on a link which uses a non-canonical URL. This means going forward when webmasters copy URLs from their browser address bar to create links on their sites to your pages, they will always get the canonical URL.

Remember… don’t just implement 301s for this specific folder. You want to generalize the rules in your canonicalization policy and in the code used to enforce the rules such that if someone requests http://example.com/someotherfolder/index.html that the enforcement code knows to 301 redirect the request to http://www.example.com/someotherfolder/.  In other words, you don’t want to be adding redirect rules everytime you add a new URL or a new folder with a default document.

How do you implement 301 redirects for URL canonicalization enforcement?

Hopefully your site is hosted on an Apache web server.  There are lots of reasons that most web sites that exist on the web are hosted on Apache. Apache comes with a plethora of tools that you can use to implement 301 redirects.  But the most common and probably the most robust is Mod Rewrite. You can find tons of documentation online and even at your local bookstore on Mod Rewrite.  It’s a bit intimidating at first, but once you get a grip on using regular expressions and the flow of control within Mod Rewrite when evaluting rules and conditions in the .htaccess files, it becomes easy and its power is limitless.

If, however, your site is hosted on Microsoft’s Internet Information Server (IIS) then I would strongly suggest that you purchase ISAPI Rewrite from helicontech.com. It is basically Mod Rewrite for IIS.  The configuration files and related syntax are 99+% compatible with Apache’s Mod Rewrite.  If your site is being hosted on IIS version 7.x or higher then IIS has capabilities similar to Mod Rewrite built into the web server, but personally I would still recommend ISAPI Rewrite over IIS 7.x’s built in utilities for implementing 301 redirects.  It’s been around a LOT longer, is proven to work, and is 99+% compatible w/ Mod Rewrite so there is tons of documention for it.

I hope you found the post useful.  Stay tuned for more!

{ 4 comments… read them below or add one }

Rug Pads August 20, 2009 at 11:14 am

Great post explaining the first rule of SEO; canonicalization. I’ve seen the issues first hand on a site having different URLs for the same page kill search rankings. The worst is when you have URL query strings with parameters that contain some sort of user or channel identifier. That makes every URL appear different to Google. No way to rank in that scenario. I know for that site they have implemented a canonicalization policy but I have never done it personally; just knew about the situation.

I have a small site that sells rug pads that is hosted at GoDaddy. Using the Mod Rewrite sounds a little intimidating for small sites with no IT staff. Maybe you can blog about getting Mod Rewrite set up and how to use it sometime.

Craig

Petsvetshop October 12, 2009 at 6:22 pm

Great article, thanks. I have only just recently become aware of canonicalization and your article has made the issues (and fixes) much more clear to me – thanks again

shawn December 16, 2009 at 2:21 pm

Thanks for the insight. the information you gave me on Digital Point is invaluable. I’m new to SEO and now I feel a good bit more equipped to make the necessary changes needed to get a more accurate PR.

Diana Caswell January 19, 2010 at 2:51 pm

Thanks for the post its clear and concise and regarding an issue many don’t even know exists.

Leave a Comment

Next post: