Quantcast
Channel: Web Site Configuration – Slicksurface – Tech, Design & SEO Blog
Viewing all articles
Browse latest Browse all 6

Use Apache’s .htaccess To Accomplish Cool And Useful Tasks

$
0
0

One of the reasons why Apache is such a popular web server is because it's almost infinitely expandable and flexible. There are some incredibly powerful things you can do with Apache's config settings, and .htaccess is the most common way to modify those settings.

First a little background. .htaccess is the name of the file. Yes, it starts with a dot which means on Unix based systems (like Linux and Apple's OS X) the file will be invisible. However, web authoring programs like Dreamweaver make it visible in their programs because they know it's power and importance.

You place the .htaccess file in a directory and the rules in that .htaccess file will affect that directory as well as all the subdirectories of that directory. It should be mentioned that .htaccess is a little inefficient. If you have access to your site's virtual host file, you can do pretty much the same things there more efficiently.

There are far too many things that are possible with .htaccess to discuss all of them here, so we'll just touch on some of the more common and useful things you can do.

One thing we should mention is that .htaccess is often used in combination with mod_rewrite. Apache is a modcular web application. There are probably 20-30 common modules that are used with Apache and mod_rewrite is one of them. mod_rewrite lets you do things to the URLs and based on the URLs. That may sound confusing, but it will make more sense in a moment. When you use mod_rewrite you need to have a line that reads:

RewriteEngine On

I'll be putting that in each of the examples, but when you use the code from this page you only need to have that line once in your .htaccess file - before the first mod_rewrite command.

The other thing I want to mention is that mod_rewrite is based on "regular expressions". If you really want to get into using mod_rewrite I suggest getting a book on regular expressions. You want a book that's concise and to the point since regular expresssions can get pretty complicated, but the basics are pretty straightforward.

Enforcing a canonical domain name

As you may know you can have often leave the www. out of a URL and the URL will work fine. The problem is that if you let people get to your site however they want and don't enforce either always having www or always not having www the search engines may think www and non-www are two different sites, since that's theoretically possible and was true in the early days of the world wide web. The problem with the search engines thinking you have two sites instead of one is that the authority of your site will be split and you'll have duplicate content issues. In fact the seach engine may even think one site is stealing content from the other site. The bottom line is that the search engines get confused and it's never good to get search engines confused.

The way to fix this with .htaccess is to have a statement that looks like this...

RewriteEngine On
RewriteCond %{HTTP_HOST} !^www\.slicksurface\.com$
RewriteRule ^(.*)$ http://www.slicksurface.com$1 [R=301,L]

Notice there is a RewriteCond line and a RewriteRule line. There will always be a RewriteRule line with mod_rewrite, and there can be zero or many RewriteCond lines. Think of RewriteCond as defining conditions that must be met before executing the rule.

You may know from scripting languages like Javascript that an ! means "not". Add to that the regular expression syntax ^ means "starts with" and $ means "ends with" and the first line is saying if the HTTP_HOST is NOT www.slicksurface.com. The backslashes are just there because the periods need to be escaped. And you should know (or at least guess) that HTTP_HOST is the Apache environment variable that tells you the host name of the site.

So it's going to do something if the site name is now www.slicksurface.com - that 'something' is defined in the second line. (.*) is the way in regular expressions that you grab a bunch of characters. So ^(.*)$ is saying everything from the start to the end. But realize that RewriteRule acts on only the part of the URL that's in youru directory. So if you put the .htaccess file in http://www.slicksurface.com/ Then it would be working on anything that came after that in the URL.

OK, so it's grabbed everything in the URL (not including the host name) and the second part of the second line uses $1 to tack that onto the end of http://www.slicksurface.com. The third part (the portion in brackets) tells Apache that you want to do a 301 (permanent) redirect (R=301), and that RewriteRule is the last line in what you're trying to do (L).

So try it out... Click on this link and see where you go...

http://slicksurface.com/blog/

Notice it looks like www. gets added to the URL, which is basically true (in fact it wasn't "added" the user was quickly redirected to the URL with the www in it).

Page Moved Redirects

There are times when you want to move a page from one URL to another. You try to minimize doing these types of things, but sometimes they're just unavoidable. When you do need to move a page you want to put a redirect in place to let search engine spiders and people who may be following links from other sites know that the page has moved and be able to find what they're looking for.

A while back we migrated from using Blogger to using WordPress. This meant that all the URLs changed. We could write a general rule that covered many instances (see below), but some URLs didn't work with the rule. In those cases we had to have a simple RewriteRule to handle the rewrites. Here's an example of one of them...

RewriteEngine On
RewriteRule ^2007/04/4d-backup-improves-creation-of-log.html$ http://www.slicksurface.com/2007-04/4d-backup-improves-creation-of-log-files [NC,R=301,L]

So, if it's in a .htaccess file in the /blog/ directory, that will take the URL http://www.slicksurface.com/2007/04/4d-backup-improves-creation-of-log.html and redirects it to http://www.slicksurface.com/2007-04/4d-backup-improves-creation-of-log-files - notice that I'm doing a permanent 301 redirect. And lastly, the NC means that the rule is not case sensitive.

There are few ways to do redirects with Apache, but that's how you'd do it with mod_rewrite.

Rules To Redirect Pages Based On A Pattern

When we migrated from Blogger to WordPress the directory structure went from /blog/YYYY/MM/file.htm to /blog/YYYY-MM/file - that's a pretty easy pattern to migrate and we did so with the following mod_rewrite rule:

RewriteEngine On
RewriteRule ^(\d{4})/(\d{2})/(.+)\.html$   http://www.slicksurface.com/$1-$2/$3  [NC,R=301,L]

Let's look at this closely. While the regular expression code will probably be beyon your level of expertise - you can see that it's looking for something 4 and then something 2 and you'd be right if you guessed that's the 4 digit year followed by the 2 digit month. In other words (\d{4}) gets a 4 digit numeric string and (\d{2}) a 2 digit numeric string. (.+) is much like (.*) we saw above except (.+) requires that there be characters, where (.*) can work when there's nothing there. By putting those things in parentheses, we can pull them out in the URL we want to redirect to. They simply go in order $1, $2, $3...

Making The URL Different Than The File Name & Location

Rewriting the URL is the reason mod_rewrite is named what it is. Let's say you have some files on disk but you don't want their folder structure and file names to be the actual URL. An example is the MeSH medical thesaurus we put up here on slicksurface.com. It has over a hundred thousand files that needed to be organized in folders, but we didn't want those folders to be part of the URL, so we use mod_rewrite to accomplish our goal.

Let's take an example... We have the page on "Fungi" which has the following URL:

http://www.slicksurface.com/medical-thesaurus/descriptor/D005658/fungi.htm

But there's no actual document named fungi.htm in a folder named D005658. Instead the real file is at:

http://www.slicksurface.com/medical-thesaurus/descriptor/8/D005658.htm

What I did was randomize the files into 10 directories based on the last number in their ID. Then when I write files that refer to those files I add on the title of the page as a fake file name.

Here's the mod_rewrite syntax:

RewriteEngine On
RewriteRule ^descriptor/D(\d\{5})(\d)/   descriptor/$2/D$1$2.htm [L]

So what that did was serve one file when another file was requested. If you notice the virtual file name isn't used at all, so the following URL would work just as well...

http://www.slicksurface.com/medical-thesaurus/descriptor/D005658/foo.htm

Another thing you might notice is no $ was used - we just defined the beginning of the pattern and that was enough.

Having One Template Control Everything In A Directory

Another example similar to the one we just covered is having a single template control all the URLs for a directory. For example, let's say you have a file named index.php that takes a GET parameter of 'id'. The URL might look something like this if you were calling the template directly:

http://www.slicksurface.com/test/index.php?id=1234

But you don't want to have it look like that, you want the URLs to look like this:

http://www.slicksurface.com/test/1234.htm

All the numbered pages don't have to actually exist - all those URLs can be passed onto the template using something like the following:

RewriteEngine On
RewriteRule ^(.+).htm index.php?id=$1

Using a strategy like that can be very powerful and let you have one PHP template that can be thousands of URLs that are more search engine friendly than index.php?id=1234. You can do something similar with text, but it's more complicated to handle things like spaces and special characters.

Serving Pages From Other Sites As URLs On Your Site

.htaccess can be used with more than mod_rewrite. Another use is with mod_proxy. mod_proxy pulls pages from other sites and can show them as pages on your site, though there are some issues.

Let's take the following page on Dan Wong's site - it's the home page for the advanced web site design course.

http://www.dan-wong.com/advanced-web-design.htm

Here's a mod_rewrite rule that has the P parameter specifying that it's a mod_proxy situation, not a redirect.

RewriteEngine On
RewriteRule ^test/adv-3650$ http://www.dan-wong.com/advanced-web-design.htm [P,L]

With that you can see the content of the page at the following URL:

http://www.slicksurface.com/test/adv-3650

But notice that the page doesn't look right. That's because it's not being served from the correct site, so all of the links to stylesheets and images are broken. If he had started all of the references with http://wwww.dan-wong.com/... then the page would actually look correct.

When this is useful is when you have an web application server that is responsible for some, but not all of the files on your website. In that case you can proxy the web application server, pehaps through a secure firewall, and because it's part of an overall system, where the web application server may actually think it is the entire site, the pages will render correctly.

You don't have to use mod_rewrite to use mod_proxy. Here is an example of a mod_proxy statement that you might find in an .htaccess file:

ProxyPass customer/ http://127.0.0.1:8080/customer/

That will pass all of the URLs from the customer directory onto a web application server that's responding to the 8080 port on the same machine as the web site.

Excluding Directories From Being Controlled by WordPress

WordPress controls everything in the directory it's installed in. If you want to have say a resources directory that WordPress doesn't control, or a robots.txt file that WordPress doesn't control, then you can use something like this:

RewriteEngine On
RewriteCond %{REQUEST_URI} !^resources/
RewriteCond %{REQUEST_URI} !^robots.txt
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . index.php [L]

The bottom three lines are standard for WordPress, but the top two exclude the resources directory and the robots.txt file.

Making HTM Files Execute PHP

Sometimes you want to hide the fact that a php file is really a php file, or you just like your URLs to end with .htm rather than PHP. To make it so you can put PHP in .htm files you put the following in the .htaccess file:

AddType application/x-httpd-php .htm

Wrap Up

So there's a lot you can do with .htaccess files. Chances are if you're hitting a wall and need to do something special there's a way to do it with an htaccess file.


Viewing all articles
Browse latest Browse all 6

Latest Images

Trending Articles





Latest Images