Preserving Permalinks

Fri, April 9, 2010, 07:22 AM under Blogging

One of the things that gets me on a rant is websites that break permalinks. If you have posted something somewhere and there is a public URL pointing to it, that URL should never ever return a 404. You are breaking all websites that ever linked to you and you are breaking all search engine links to your content (that others will try and follow). It is a pet peeve of mine.

So when I had to move my blog, obviously I would preserve the root URL (www.danielmoth.com/Blog/), but I also wanted to preserve every URL my blog has generated over the years. To be clear, our focus here is on the URL formatting, not the content migration which I'll talk about in my next post. In this post, I'll describe my solution first and then what it solves.

1. The IIS7 Rewrite Module and web.config

There are a few ways you can map an old URL to a new one (so when requests to the old URL come in, they get redirected to the new one). The new blog engine I use (dasBlog) has built-in functionality to do that (Scott refers to it here). Instead, the way I chose to address the issue was to use the IIS7 rewrite module.

The IIS7 rewrite module allows redirecting URLs based on pattern matching, regular expressions and, of course, hardcoded full URLs for things that don't fall into any pattern. You can configure it visually from IIS Manager using a handy dialog that allows testing patterns against input URLs. Here is what mine looked like after configuring a few rules:

URL Rewrite

To learn more about this technology check out this video, the reference page and this overview blog post; all 3 pages have a collection of related resources at the bottom worth checking out too.

All the visual configuration ends up in a web.config file at the root folder of your website. If you are on a shared hosting service, probably the only way you can use the Rewrite Module is by directly editing the web.config file. Next, I'll describe the URLs I had to map and how that manifested itself in the web.config file. What I did was create the rules locally using the GUI, and then took the generated web.config file and uploaded it to my live site. You can view my web.config here.

2. Monthly Archives

Observe the difference between the way the two blog engines generate this type of URL

  • Blogger: /Blog/2004_07_01_mothblog_archive.html
  • dasBlog: /Blog/default,month,2004-07.aspx

In my web.config file, the rule that deals with this is the one named "monthlyarchive_redirect".

3. Categories

Observe the difference between the way the two blog engines generate this type of URL

  • Blogger: /Blog/labels/Personal.html
  • dasBlog: /Blog/CategoryView,category,Personal.aspx

In my web.config file the rule that deals with this is the one named "category_redirect".

4. Posts

Observe the difference between the way the two blog engines generate this type of URL

  • Blogger: /Blog/2004/07/hello-world.html
  • dasBlog: /Blog/Hello-World.aspx

In my web.config file the rule that deals with this is the one named "post_redirect".

Note: The decision is taken to use dasBlog URLs that do not include the date info (see the description of my Appearance settings). If we included the date info then it would have to include the day part, which blogger did not generate. This makes it impossible to redirect correctly and to have a single permalink for blog posts moving forward. An implication of this decision, is that no two blog posts can have the same title. The tool I will describe in my next post (inelegantly) deals with duplicates, but not with triplicates or higher.

5. Unhandled by a generic rule

Unfortunately, the two blog engines use different rules for generating URLs for blog posts. Most of the time the conversion is as simple as the example of the previous section where a post titled "Hello World" generates a URL with the words separated by a hyphen. Some times that is not the case, for example:

  • /Blog/2006/05/medc-wrap-up.html
  • /Blog/MEDC-Wrapup.aspx

or

  • /Blog/2005/01/best-of-moth-2004.html
  • /Blog/Best-Of-The-Moth-2004.aspx

or

  • /Blog/2004/11/more-windows-mobile-2005-details.html
  • /Blog/More-Windows-Mobile-2005-Details-Emerge.aspx

In short, blogger does not add words to the title beyond ~39 characters, it drops some words from the title generation (e.g. a, an, on, the), and it preserve hyphens that appear in the title. For this reason, we need to detect these and explicitly list them for redirects (no regular expression can help here because the full set of rules is not listed anywhere).

In my web.config file the rule that deals with this is the one named "Redirect rule1 for FullRedirects" combined with the rewriteMap named "StaticRedirects".

Note: The tool I describe in my next post will detect all the URLs that need to be explicitly redirected and will list them in a file ready for you to copy them to your web.config rewriteMap.

6. C# code doing the same as the web.config

I wrote some naive code that does the same thing as the web.config: given a string it will return a new string converted according to the 3 rules above. It does not take into account the 4th case where an explicit hard-coded conversion is needed (the tool I present in the next post does take that into account).

  static string REGEX_post_redirect           = "[0-9]{4}/[0-9]{2}/([0-9a-z-]+).html";
  static string REGEX_category_redirect       = "labels/([_0-9a-z-% ]+).html";
  static string REGEX_monthlyarchive_redirect = "([0-9]{4})_([0-9]{2})_[0-9]{2}_mothblog_archive.html";

  static string Redirect(string oldUrl)
  {
    GroupCollection g;
    if (RunRegExOnIt(oldUrl, REGEX_post_redirect, 2, out g))
      return string.Concat(g[1].Value, ".aspx");

    if (RunRegExOnIt(oldUrl, REGEX_category_redirect, 2, out g))
      return string.Concat("CategoryView,category,", g[1].Value, ".aspx");

    if (RunRegExOnIt(oldUrl, REGEX_monthlyarchive_redirect, 3, out g))
      return string.Concat("default,month,", g[1].Value, "-", g[2], ".aspx");

    return string.Empty;
  }

  static bool RunRegExOnIt(string toRegEx, string pattern, int groupCount, out GroupCollection g)
  {
    if (pattern.Length == 0)
    {
      g = null;
      return false;
    }
    g = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled).Match(toRegEx).Groups;

    return (g.Count == groupCount);
  }