Save
Bandwidth and Increase Performance with Cache-Control Response Headers
Jeffrey Fulmer
Most Web sites contain static elements that are shared by several
pages. Each page on a template-driven site will likely contain common
elements such as style sheets, Java scripts, and images. As a browser
parses HTML, it looks for items required to construct the page.
If each request requires the browser to repeatedly download the
same elements, a lot of unnecessary bandwidth will be consumed in
just a few short clicks.
The problem is compounded during periods of peak activity. Flash
crowds keep Web systems administrators awake at night. If our hypothetical
browser is one of thousands pulling down information on breaking
news or is following a Slashdot link, the administrator will want
to conserve all possible bandwidth. For this reason, HTTP protocol
contains cache directives, and contemporary browsers cache documents
to reduce bandwidth. These mechanisms provide a means for document
reuse.
Overview
When a browser downloads resources, it may store them in memory
or on disk. It may "cache" the documents. If a stored element is
required a second time, then the browser may simply pull it from
its cache. This saves time and bandwidth. It may download documents
once and use them repetitively. Our tireless administrator may get
some sleep after all. Of course, nothing is free. There are problems
associated with this solution. For one thing, most humans can't
afford the resources necessary to support an infinite cache. For
another, documents change. The browser overcomes the first limitation
by expiring elements from its personal cache. It overcomes the second
by asking the Web server if its personal copy is up to date.
HTTP protocol provides several directives to facilitate document
caching. When a user clicks a link, the browser or its proxy server
may store elements for reuse. During this transaction, the Web server
may affix an expiration date to each document. This designates its
"freshness" period. Just as you may safely use refrigerated sour
cream before its expiration date, a browser may use cached elements
without contacting the Web server as long as those elements remain
"fresh." If a cached document is available after its expiration,
or after it has become "stale," then the browser must ask the Web
server whether it may still be used. This is known as "cache revalidation."
If-Modified-Since Header
The most frequently used mechanism to revalidate a cached document
is the If-Modified-Since header. If the browser has a stale document
in its cache, it may request that document on the condition that
its version is no longer valid. In this situation, the Web server
has two options -- it may tell the browser to use the copy from
its cache, or it may deliver a new document. To illustrate this
interaction between a browser and server, let's examine an HTTP
header exchange.
In the first scenario, the Web server sends HTTP code 304 to indicate
that the cached copy is still valid. The browser may use the one
it has. Here's the transaction:
GET /images/limey.gif HTTP/1.1
Host: www.joedog.org:80
Cookie:
Accept: */*
Accept-Encoding: *
User-Agent: Mozilla/5.0 (X11; U; AIX 4.2; en-US; rv:1.4) Gecko/20031128
If-Modified-Since: Tue, 01 Jun 2004 05:10:39 GMT
Connection: close
HTTP/1.1 304 Not Modified
Date: Thu, 09 Sep 2004 13:11:52 GMT
Server: Apache/1.3.29 (Unix) PHP/4.3.6 mod_ssl/2.8.16 OpenSSL/0.9.7d
Connection: close
ETag: "f13b-51bd-40b21ef5"
In a second scenario, the server sends HTTP code 200 along with a
new copy of the document:
GET /images/limey.gif HTTP/1.1
Host: www.joedog.org:80
Cookie:
Accept: */*
Accept-Encoding: *
User-Agent: Mozilla/5.0 (X11; U; AIX 4.2; en-US; rv:1.4) Gecko/20031128
If-Modified-Since: Mon, 24 May 2004 16:12:36 GMT
Connection: close
HTTP/1.1 200 OK
Date: Thu, 09 Sep 2004 13:15:08 GMT
Server: Apache/1.3.29 (Unix) PHP/4.3.6 mod_ssl/2.8.16 OpenSSL/0.9.7d
Last-Modified: Mon, 24 May 2004 16:12:37 GMT
ETag: "f13b-51bd-40b21ef5"
Accept-Ranges: bytes
Content-Length: 20925
Connection: close
Content-Type: image/gif
In both situations, our browser opened a connection to the Web server.
But in the first request, the only thing exchanged was HTTP header
information. In the second case, the browser pulled down a 21K file.
It doesn't take much imagination to see how HTTP cache control can
save bandwidth, especially on a Web site with a lot of common templates.
As human beings, we understand our Web sites better than a browser
or a Web server. We know which elements will be modified and which
ones will not. Some elements aren't going to change between now
and the end of time (e.g., spacer.gif). It would be nice if we could
provide some input to influence cache control.
Cache-Control Header
Fortunately, HTTP protocol provides mechanisms that allow us to
specify document expirations. The Cache-Control header can be used
to set the maximum age in seconds of a cached document. This is
the elapsed time from when it was generated until when it can no
longer be served. For example, this directive tells the cache to
expire the document two hours from now:
Cache-Control: max-age=640000
The Expires header allows us to set an absolute expiration date. Once
that moment has passed, the document is considered stale. Here is
another example:
Expires: Mon, 13 Sep 2004, 16:00:00 GMT
Of the two, Cache-Control is preferable. The Expires directive depends
on clock synchronization. The proliferation of appliance clocks that
blink "12:00" should suggest a problem with that dependency.
Expires Module
The Apache Web server provides several mechanisms that allow us
to explicitly set cache directives. The Expires module, a.k.a. mod_expires,
is bundled with Apache 1.2 and higher (see the sidebar "Adding Modules"
for more information). It allows the administrator to set the Cache-Control
and Expires HTTP headers. Expirations may be set according to either
a file's modification time or the last time it was accessed by the
client. We can configure documents to expire immediately or well
into the distant future.
The module contains two directives for setting expirations. ExpiresDefault
sets the expiration time for an entire server configuration, a virtual
host or a directory. ExpiresByType allows you to set expirations
by MIME type (e.g., expire the cache for every jpeg in /images/maps
one hour from now). The syntax for both directives looks like this:
ExpiresDefault "<base> [plus] {<num> <type>}*"
ExpiresByType type/encoding "<base> [plus] {<num> <type>}*"
Now let's consider the key components in the directives above: base,
plus, num, and type. <base> is a reference time. It may
be set to either "now" or "modification." "Now" refers to the access
time, and "modification" refers to the file's modification time. The
second component is an optional keyword. [plus] makes the configuration
easier to understand; "now plus time" makes more sense than "now time."
The final two components are coupled together. <num>
is an integer and <type> describes it. For example,
"now plus 1 day" expires 24 hours after access, whereas "now plus
0 seconds" expires immediately. The module supports the following
types: years, months, weeks, days, hours, minutes, and seconds.
Let's consider some sample configurations:
<Directory "/data/www/public_html/images/">
<IfModule mod_expires.c>
ExpiresAction On
ExpiresDefault "now plus 2 weeks"
</IfModule>
</Directory>
In the preceding example, we introduced the ExpiresActive directive.
It takes a single argument that is either "on" or "off." This enables
or disables the module. The configuration is applied at directory
level. We used the ExpiresDefault directive to expire every thing
in the directory two weeks after it's accessed.
We could also expire items at different times and by different
types with the ExpiresByType directive:
<Directory "/data/www/public_html/common/">
<IfModule mod_expires.c>
ExpiresActive On
ExpiresByType image/png "now plus 24 hours"
ExpiresByType image/gif "now plus 0 minutes"
ExpiresByType text/css "now plus 2 hours"
</IfModule>
</Directory>
Headers Module
Apache provides another mechanism to send cache control instructions
to the client. The Headers module, mod_headers, lets administrators
customize HTTP response headers. If we can write our own headers,
then why not write Cache-control or Expires headers? The module
contains two directives that allow us to customize HTTP response.
Header allows us to write 1xx and 2xx headers, and ErrorHeader allows
us to write 3xx, 4xx, and 5xx headers. The syntax looks like this:
Header set|append|add|unset <header> <value>
ErrorHeader set|append|add|unset <header> <value>
For our purposes, set and unset are the most important
arguments. The former sets the response header and replaces any existing
ones with the same name. The latter removes the header; if several
headers with the same name exist, then it unsets them all. Consider
the following example:
<FilesMatch "*.gif">
<IfModule mod_headers.c>
Header set Cache-control max-age=9200
</IfModule>
</FilesMatch>
We can send multiple Cache-control headers to the client. If we want
the client to revalidate the document and not store it in cache, we
can send this combination:
<FilesMatch "*.gif">
<IfModule mod_headers.c>
Header set Cache-control \
"no-cache, no-store"
</IfModule>
</FilesMatch>
In the example above, no-cache tells the client to revalidate
the document, and no-store instructs it not to place the document
in cache. Here are some other Cache-control headers to consider: max-age=num
sets the freshness period in seconds and must-revalidate requires
the client to always revalidate. For a complete list of Cache-control
headers and their meanings, see RFC 2616 at:
ftp://ftp.rfc-editor.org/in-notes/rfc2616.txt
From the examples above, it's obvious that we can set cache directives
by manipulating HTTP response headers. While those modules make it
easy to set cache controls for large portions of a Web site, as long
as we can write response headers, we can send those instructions.
Consider this CGI script:
#! /bin/sh
echo Content-type: text/plain
echo Cache-control: must-revalidate
echo
echo Hello, world.
We can also embed these directives in HTML 2.0 and higher:
<html>
<head>
<title>Hello, World</title>
<meta http-equiv="Cache-control" content="must-revalidate">
</head>
<body><b>Hello, World</b></body>
</html>
See RFC1866 for more information at:
ftp://ftp.rfc-editor.org/in-notes/rfc1866.txt
The techniques I've discussed provide a Web systems administrator
with a means to reduce latency, save bandwidth, and decrease server
load. Best of all, they require no out-of-pocket expenses. A log analysis
tool can demonstrate improvement from one month to the next. The perfect
time to present your boss with such reports is one month before raises
are determined. Enjoy.
Jeff Fulmer is a Web systems administrator in a Fortune 500
corporation. He's administered Web servers professionally for more
than eight years. His software was featured as UnixReview.com's
Tool of the Month in July 2002 (http://www.unixreview.com/documents/s=7458/uni1026336671481/).
Jeff is the author of siege, an open source http benchmarking and
regression tester. |