Hosting personal web pages without the tilde

This week one of the things I dealt with was the elimination of the tilde ("~") from URL's as part of the migration process to our new personal web hosting service. Searching around, I found that there's not a lot of information about how to do this, though all the tools are there if you know where to look. In an attempt to save others some time, here's what I did...

h2. The problem

Right now we now have URL's that look like "http://www.sub.dom.edu/~userid/" and we want them to loose that tilde when we deploy our new system. But since the URL's with the tilde have been around for many years, we need to also ensure that those links keeps working for the users.

The "~" serves a purpose in Apache. It signifies that the following string is a username, and Apache should use the mod_userdir module, if enabled, to find out where that user lives on in the filesystem and present their data. It also serves to preserve the root namespace ("/") so people making pages for the top of the site don't have to worry about their directories clashing with usernames. Generally, mod_userdir is also handy, because it can find out where a user lives (and thus, their web content) easily making the sysadmin's life easier. Thus, "getting rid of the the tilde" isn't as easy as one would think or hope.

h2. Which way to skin the cat?

We have a few options to get where we want to be:

* Rewrite the ~ to some other string, like "people", so it's more natural, but preserves the namespace for "/" somewhat (eg: www.sub.dom.edu/people/userid). This makes for longer URL's and was therefore not really an option for us.

* Rewrite the ~ away, so everything lives in the "/" namespace (eg: www.sub/dom/edu/userid). Since we have actual content at the root of our directory, this really isn't an option, because eventually we'll see namepsace clash between the users and the content designers.

* Get rid of the tilde in total and instead give everyone their own personalized vhost (eg: userid.sub.dom.edu). This keeps all the namespaces clean, looks quite personal for the user and maintaing the older ~ based links for backwards compatibility is pretty easy. We chose this option.

h2. The DNS setup

The first trick is making the DNS name "userid.sub.dom.edu" resolve to the address of our webserver. The easy way to do this is by using a "wildcard DNS entry":http://en.wikipedia.org/wiki/Wildcard_DNS_entry . With this, every name under "sub.dom.edu" will resolve to our webserver, including our desired "userid.sub.dom.edu" domain names. We'll use the domain name to figure out which user we we want, and return a "not found" error for everything else. We used an entry similar to this:

p=. @*.sub.dom.edu. in cname name.of.server.edu.@

h2. The Apache setup for the wildcard domains

The VirtualHost entry for Apache looks like pretty much any other entry except we want to use two extra options. The first is "ServerAlias" which allows us to have an additional names for our VirtualHost. We used the same wildcard option we used in our DNS setup above, "*.sub.dom.edu."

The next option we need is "UseCanonicalName" which configures how Apache figure out who it is (see "the doc":http://httpd.apache.org/docs/2.0/mod/core.html#usecanonicalname for more info). We set this to "Off" which boils down to "use whatever the browser gave us in the HTTP Host: header" namely "user.sub.dom.edu." This also commits us to only supporting browsers which do HTTP/1.1 which is where the Host: header became mandatory.

The rest of this entry should be fairly standard. Obviously we've turned off the use of mod_userdir because we're reimplementing its functionality sans-tilde and we've setup a Directory stanza to give permissions to the user data.

h2. Finding the users

At this point, we should have a VirtualHost which answers for our wildcard DNS names. But how do we map this domain to the user and their data? We'll need to add some more configuration to our VirtualHost in order to do this.

Apache discusses this issue in their "documentation about mass hosting":http://httpd.apache.org/docs/2.0/vhosts/mass.html . In essance there are two methods: "mod_vhost_alias":http://httpd.apache.org/docs/2.0/mod/mod_vhost_alias.html and "mod_rewrite":http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html . Which method is right for you depends on the complexity of your current environment.

mod_vhost_alias depends on you being able to map a user to some filesystem path for the user based solely on the information contained in the hostname. You can't call out to a program or make any standard library calls to find a user's home directory. If you've laid out your filesystem and mountpoints ahead of time to account for this, then you're all set, mod_vhost_alias will do most of the heavy-lifting of mapping hosts to user directories and everything will work with one or two lines of fairly simple configuration. See the mod_vhost_alias docs for some examples.

Of course, our filesystem is not laid out in a manner that will work for mod_vhost, so we'll have to use mod_rewrite to accomplish this goal.

Our problem is that we need to know where in the filesystem a user lives. We don't have this information encoded in the hostname because of the way the system mounts the user data and since we're using linux we don't have union or junction mounts to flatten the filesystem to make it happy for mod_vhost. Thus, we need to use mod_rewrite's "RewriteMap":http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html#rewritemap feature to do the maping using a script called "find-a-user" that knows which users live where in the filesystem. Our script is highly site-specific so I won't post it here, see the sample in the RewriteMap doc for what one looks like and how to write a "prg" type map.

The config looks similar to this:

# Enable mod_rewrite
RewriteEngine on

# Debug anyone?
RewriteLog "logs/rewrite.log"
RewriteLogLevel 1

# First, lowercase the hostname please.
# DNS is not case sensitive, but our filesystem is
RewriteMap lowercase int:tolower

# now find a user using the find-a-user script by appending the hostname to the request
# and then splitting the first part of the host out, passing it through the find-a-user
# script and then using the result to find the correct mount and appending the actual
# RHS of the URI to get to a real file. Sounds simple, right?
RewriteMap users prg:/usr/local/bin/find-a-user
RewriteCond %{HTTP_HOST} ^[^.]+\.sub\.dom\.edu$
RewriteRule ^(.+) %{SERVER_NAME}$1 [C]
RewriteRule ^([^.]+)\.sub\.dom\.edu(.*) ${users:$1}/public_html$2

Wheeee! Now we have a virtual host that answers for any domain and will attempt to use the first part of the domain name to map it to a user directory. If there's no such user, people will get a 404 error code, else they'll get the content created by the user.

h2. What about the old links?

Obviously, our setup above does not make any attempt to preserve the links to the old system which use the ~ style URL's. This is because we're switching domain names at the same time we switch to the new system. Therefore, we can use the old domain to simply redirect to the new style user.sub.dom.edu domains. Something like the following in a VirtualHost for the old domain should ensure that all of the old ~ style URI's get redirected correctly.

p=. @RedirectMatch 301 ^/~([^.]+)(.*)$ http://$1.sub.dom.edu/$2@

Since we're moving domains also, we added a number of other Redirects to handle our site-specific pages, documenation, etc.

h2. Hope you found this useful!

My main motiviation for writing this was because there seemed to be little docuemnation about how to do this. Even the Apache docs from where much of this was culled didn't see quite clear and didn't pop up when searching with Google. My hope is that this writeup will help someone later save someone time and trouble.

If you've solved this problem another way, please leave a comment about how you did it!