Cover V14, i03

Article
Sidebar

mar2005.tar

Save Bandwidth and Increase Performance with Cache-Control Response Headers

Jeffrey Fulmer

Most Web sites contain static elements that are shared by several pages. Each page on a template-driven site will likely contain common elements such as style sheets, Java scripts, and images. As a browser parses HTML, it looks for items required to construct the page. If each request requires the browser to repeatedly download the same elements, a lot of unnecessary bandwidth will be consumed in just a few short clicks.

The problem is compounded during periods of peak activity. Flash crowds keep Web systems administrators awake at night. If our hypothetical browser is one of thousands pulling down information on breaking news or is following a Slashdot link, the administrator will want to conserve all possible bandwidth. For this reason, HTTP protocol contains cache directives, and contemporary browsers cache documents to reduce bandwidth. These mechanisms provide a means for document reuse.

Overview

When a browser downloads resources, it may store them in memory or on disk. It may "cache" the documents. If a stored element is required a second time, then the browser may simply pull it from its cache. This saves time and bandwidth. It may download documents once and use them repetitively. Our tireless administrator may get some sleep after all. Of course, nothing is free. There are problems associated with this solution. For one thing, most humans can't afford the resources necessary to support an infinite cache. For another, documents change. The browser overcomes the first limitation by expiring elements from its personal cache. It overcomes the second by asking the Web server if its personal copy is up to date.

HTTP protocol provides several directives to facilitate document caching. When a user clicks a link, the browser or its proxy server may store elements for reuse. During this transaction, the Web server may affix an expiration date to each document. This designates its "freshness" period. Just as you may safely use refrigerated sour cream before its expiration date, a browser may use cached elements without contacting the Web server as long as those elements remain "fresh." If a cached document is available after its expiration, or after it has become "stale," then the browser must ask the Web server whether it may still be used. This is known as "cache revalidation."

If-Modified-Since Header

The most frequently used mechanism to revalidate a cached document is the If-Modified-Since header. If the browser has a stale document in its cache, it may request that document on the condition that its version is no longer valid. In this situation, the Web server has two options -- it may tell the browser to use the copy from its cache, or it may deliver a new document. To illustrate this interaction between a browser and server, let's examine an HTTP header exchange.

In the first scenario, the Web server sends HTTP code 304 to indicate that the cached copy is still valid. The browser may use the one it has. Here's the transaction:

GET /images/limey.gif HTTP/1.1
Host: www.joedog.org:80
Cookie:
Accept: */*
Accept-Encoding: *
User-Agent: Mozilla/5.0 (X11; U; AIX 4.2; en-US; rv:1.4) Gecko/20031128
If-Modified-Since: Tue, 01 Jun 2004 05:10:39 GMT
Connection: close

HTTP/1.1 304 Not Modified
Date: Thu, 09 Sep 2004 13:11:52 GMT
Server: Apache/1.3.29 (Unix) PHP/4.3.6 mod_ssl/2.8.16 OpenSSL/0.9.7d
Connection: close
ETag: "f13b-51bd-40b21ef5"
In a second scenario, the server sends HTTP code 200 along with a new copy of the document:

GET /images/limey.gif HTTP/1.1
Host: www.joedog.org:80
Cookie:
Accept: */*
Accept-Encoding: *
User-Agent: Mozilla/5.0 (X11; U; AIX 4.2; en-US; rv:1.4) Gecko/20031128
If-Modified-Since: Mon, 24 May 2004 16:12:36 GMT
Connection: close

HTTP/1.1 200 OK
Date: Thu, 09 Sep 2004 13:15:08 GMT
Server: Apache/1.3.29 (Unix) PHP/4.3.6 mod_ssl/2.8.16 OpenSSL/0.9.7d
Last-Modified: Mon, 24 May 2004 16:12:37 GMT
ETag: "f13b-51bd-40b21ef5"
Accept-Ranges: bytes
Content-Length: 20925
Connection: close
Content-Type: image/gif
In both situations, our browser opened a connection to the Web server. But in the first request, the only thing exchanged was HTTP header information. In the second case, the browser pulled down a 21K file. It doesn't take much imagination to see how HTTP cache control can save bandwidth, especially on a Web site with a lot of common templates.

As human beings, we understand our Web sites better than a browser or a Web server. We know which elements will be modified and which ones will not. Some elements aren't going to change between now and the end of time (e.g., spacer.gif). It would be nice if we could provide some input to influence cache control.

Cache-Control Header

Fortunately, HTTP protocol provides mechanisms that allow us to specify document expirations. The Cache-Control header can be used to set the maximum age in seconds of a cached document. This is the elapsed time from when it was generated until when it can no longer be served. For example, this directive tells the cache to expire the document two hours from now:

Cache-Control: max-age=640000
The Expires header allows us to set an absolute expiration date. Once that moment has passed, the document is considered stale. Here is another example:

Expires: Mon, 13 Sep 2004, 16:00:00 GMT
Of the two, Cache-Control is preferable. The Expires directive depends on clock synchronization. The proliferation of appliance clocks that blink "12:00" should suggest a problem with that dependency.

Expires Module

The Apache Web server provides several mechanisms that allow us to explicitly set cache directives. The Expires module, a.k.a. mod_expires, is bundled with Apache 1.2 and higher (see the sidebar "Adding Modules" for more information). It allows the administrator to set the Cache-Control and Expires HTTP headers. Expirations may be set according to either a file's modification time or the last time it was accessed by the client. We can configure documents to expire immediately or well into the distant future.

The module contains two directives for setting expirations. ExpiresDefault sets the expiration time for an entire server configuration, a virtual host or a directory. ExpiresByType allows you to set expirations by MIME type (e.g., expire the cache for every jpeg in /images/maps one hour from now). The syntax for both directives looks like this:

ExpiresDefault "<base> [plus] {<num> <type>}*"
ExpiresByType type/encoding "<base> [plus] {<num> <type>}*"
Now let's consider the key components in the directives above: base, plus, num, and type. <base> is a reference time. It may be set to either "now" or "modification." "Now" refers to the access time, and "modification" refers to the file's modification time. The second component is an optional keyword. [plus] makes the configuration easier to understand; "now plus time" makes more sense than "now time."

The final two components are coupled together. <num> is an integer and <type> describes it. For example, "now plus 1 day" expires 24 hours after access, whereas "now plus 0 seconds" expires immediately. The module supports the following types: years, months, weeks, days, hours, minutes, and seconds.

Let's consider some sample configurations:

<Directory "/data/www/public_html/images/">
  <IfModule mod_expires.c>
    ExpiresAction On
    ExpiresDefault "now plus 2 weeks"
  </IfModule>
</Directory>
In the preceding example, we introduced the ExpiresActive directive. It takes a single argument that is either "on" or "off." This enables or disables the module. The configuration is applied at directory level. We used the ExpiresDefault directive to expire every thing in the directory two weeks after it's accessed.

We could also expire items at different times and by different types with the ExpiresByType directive:

<Directory "/data/www/public_html/common/">
  <IfModule mod_expires.c>
    ExpiresActive On
    ExpiresByType image/png  "now plus 24 hours"
    ExpiresByType image/gif  "now plus 0 minutes"
    ExpiresByType text/css   "now plus 2 hours"
  </IfModule>
</Directory>
Headers Module

Apache provides another mechanism to send cache control instructions to the client. The Headers module, mod_headers, lets administrators customize HTTP response headers. If we can write our own headers, then why not write Cache-control or Expires headers? The module contains two directives that allow us to customize HTTP response. Header allows us to write 1xx and 2xx headers, and ErrorHeader allows us to write 3xx, 4xx, and 5xx headers. The syntax looks like this:

Header set|append|add|unset <header> <value>
ErrorHeader set|append|add|unset <header> <value>
For our purposes, set and unset are the most important arguments. The former sets the response header and replaces any existing ones with the same name. The latter removes the header; if several headers with the same name exist, then it unsets them all. Consider the following example:

<FilesMatch "*.gif">
  <IfModule mod_headers.c>
    Header set Cache-control max-age=9200
  </IfModule>
</FilesMatch>
We can send multiple Cache-control headers to the client. If we want the client to revalidate the document and not store it in cache, we can send this combination:

<FilesMatch "*.gif">
  <IfModule mod_headers.c>
    Header set Cache-control \
      "no-cache, no-store"
  </IfModule>
</FilesMatch>
In the example above, no-cache tells the client to revalidate the document, and no-store instructs it not to place the document in cache. Here are some other Cache-control headers to consider: max-age=num sets the freshness period in seconds and must-revalidate requires the client to always revalidate. For a complete list of Cache-control headers and their meanings, see RFC 2616 at:

ftp://ftp.rfc-editor.org/in-notes/rfc2616.txt
From the examples above, it's obvious that we can set cache directives by manipulating HTTP response headers. While those modules make it easy to set cache controls for large portions of a Web site, as long as we can write response headers, we can send those instructions. Consider this CGI script:

#! /bin/sh
echo Content-type: text/plain
echo Cache-control: must-revalidate
echo
echo Hello, world.
We can also embed these directives in HTML 2.0 and higher:

<html>
<head>
  <title>Hello, World</title>
  <meta http-equiv="Cache-control" content="must-revalidate">
</head>
<body><b>Hello, World</b></body>
</html>
See RFC1866 for more information at:

ftp://ftp.rfc-editor.org/in-notes/rfc1866.txt
The techniques I've discussed provide a Web systems administrator with a means to reduce latency, save bandwidth, and decrease server load. Best of all, they require no out-of-pocket expenses. A log analysis tool can demonstrate improvement from one month to the next. The perfect time to present your boss with such reports is one month before raises are determined. Enjoy.

Jeff Fulmer is a Web systems administrator in a Fortune 500 corporation. He's administered Web servers professionally for more than eight years. His software was featured as UnixReview.com's Tool of the Month in July 2002 (http://www.unixreview.com/documents/s=7458/uni1026336671481/). Jeff is the author of siege, an open source http benchmarking and regression tester.