Wednesday, August 27th, 2008

Proxy issues with querystrings in path names

Category: Performance

<p>You have seen this before: /path/to/something.js?v=2, or maybe it used a date or a version control id or some such. The notion of putting the version into the URL so you can aggressively cache and yet quickly push new versions.

There has long been issues with using the querystring as the version. At some point I seem to remember Safari not going a good job caching that scenario and thinking that it was different.

Steve “Neo” Souders has posted about this issue especially as it relates to proxy servers and default configurations:

There’s a section in my book called Revving Filenames. It contains an example of adding a version number to the filename. That’s prompted several emails where people have asked me about tradeoffs around using a querystring versus embedding something in the filename. I wasn’t aware of any performance difference, but in a meeting this week a co-worker, Jacob Hoffman-Andrews, mentioned that Squid, a popular proxy, doesn’t cache resources with a querystring. This hurts performance when multiple users behind a proxy cache request the same file – rather than using the cached version everybody would have to send a request to the origin server.

I tested this by creating two resources, mylogo.1.2.gif and mylogo.gif?v=1.2. Both have a far future Expires date. I configured my browser to go through a Squid proxy. I made one request to mylogo.1.2.gif, cleared my cache (to simulate another user making the request), and fetched mylogo.1.2.gif again. This produces the following HTTP headers:

>> GET http://stevesouders.com/mylogo.1.2.gif HTTP/1.1
<< HTTP/1.0 200 OK

<< Date: Sat, 23 Aug 2008 00:17:22 GMT
<< Expires: Tue, 21 Aug 2018 00:17:22 GMT
<< X-Cache: MISS from someserver.com
<< X-Cache-Lookup: MISS from someserver.com

>> GET http://stevesouders.com/mylogo.1.2.gif HTTP/1.1
<< HTTP/1.0 200 OK
<< Date: Sat, 23 Aug 2008 00:17:22 GMT
<< Expires: Tue, 21 Aug 2018 00:17:22 GMT
<< X-Cache: HIT from someserver.com

<< X-Cache-Lookup: HIT from someserver.com

Notice that the second response shows a HIT in the X-Cache and X-Cache-Lookup headers. This shows it was served by the Squid proxy. More evidence of this is the fact that the Date and Expires response headers have the same values, even though I made these requests 10 seconds apart. For conclusive evidence, only one hit shows up in the stevesouders.com access log.

Loading mylogo.gif?v=1.2 twice (clearing the cache in between) results in these headers:

>> GET http://stevesouders.com/mylogo.gif?v=1.2 HTTP/1.1
<< HTTP/1.0 200 OK
<< Date: Sat, 23 Aug 2008 00:19:34 GMT
<< Expires: Tue, 21 Aug 2018 00:19:34 GMT

<< X-Cache: MISS from someserver.com
<< X-Cache-Lookup: MISS from someserver.com

>> GET http://stevesouders.com/mylogo.gif?v=1.2 HTTP/1.1
<< HTTP/1.0 200 OK
<< Date: Sat, 23 Aug 2008 00:19:47 GMT
<< Expires: Tue, 21 Aug 2018 00:19:47 GMT
<< X-Cache: MISS from someserver.com
<< X-Cache-Lookup: MISS from someserver.com

Here it’s clear the second response was not served by the proxy: the caching response headers say MISS, the Date and Expires values change, and tailing the stevesouders.com access log shows two hits.

Proxy administrators can change the configuration to support caching resources with a querystring, when the caching headers indicate that is appropriate. But the default configuration is what web developers should expect to encounter most frequently. Another interesting note about these tests: notice how the proxy downgrades the responses to HTTP/1.0. This is going to alter browser behavior in terms of the number of connections that are opened. When I’m doing performance analysis I make sure to avoid being connected through a proxy.

Related Content:

Posted by Dion Almaer at 6:06 am
4 Comments

++---
2.4 rating from 7 votes

4 Comments »

Comments feed TrackBack URI

This has been known for a long time and it is especially important to reconfigure your proxy server when it runs in reverse proxy configuration.

On possible way around this is to put the version into the path and use mod_rewrite or a similar technique to remove it.

Comment by Malde — August 27, 2008

A quick look across the Alexa top ten U.S. sites shows that six of them suffer from this problem of using a querystring. I only looked at resources that were intended to be cached (had a future Expires or max-age time).

0 – http://www.aol.com/
0 – http://www.ebay.com/
13 – # http://www.facebook.com/
0 – http://www.google.com/search?q=flowers
4 – http://search.live.com/results.aspx?q=flowers
1 – http://www.msn.com/
1 – http://www.myspace.com/
12 – http://en.wikipedia.org/wiki/Flowers
0 – http://www.yahoo.com/
1 – http://www.youtube.com/

Facebook has 118 resources (!), so only having 13 of them suffer from a querystring isn’t bad, but URLs like “facebook_logo.gif?0:67387″ could be fixed. For Wikipedia, 12 out of 20 cacheable resources contain a querystring (e.g., “ajax.js?169″).

Many of these sites have clearly worked hard to avoid querystrings, with URLs like “js_3.011.js” and “/4_7_0_227490/main.js”. That’s great, but there’s still more cleanup to do. A comment from my blog post pointed to a technique for versioning filenames posted by Kevin Hale that gives the specifics about how to tackle this in Apache with mod_rewrite.

Comment by souders — August 27, 2008

Running through a proxy will always skew your performance analysis of course.

The issue with Squid is that it is almost 1.1, but not completely. From the top of my head the biggie is chunked encoding. Hopefully it will become 1.1 one day. But the code seems to be very hard to change.

Comment by berend — August 27, 2008

Over on Steve’s post is this nugget:

“squid actually changed their default policy with caching dynamic URLs with their 2.7 release:
http://wiki.squid-cache.org/ConfigExamples/DynamicContent

Comment by PaulIrish — September 24, 2009

Leave a comment

You must be logged in to post a comment.