Wednesday, October 1st, 2008
Category: Performance
, Utility

Steve Souders is launching Hammerhead today at The Ajax Experience.
What is Hammerhead? I kinda think of it as continuous integration for performance. It is a Firebug plugin that you can setup to monitor the performance of your application. Imagine if you add a new feature that you think will speed things up, this tool will let you know how performance was really affected.
There are also cool features when you just want to whip it up on your own Firebug:
Even if you’re not hammering a site, other features make Hammerhead a useful add-on. The Cache & Time panel, shown in Figure 3, shows the current URL’s load time. It also contains buttons to clear the disk and memory cache, or just the memory cache. It has another feature that I haven’t seen anywhere else. You can choose to have Hammerhead clear these caches after every page view. This is a nice feature for me when I’m loading the same page again and again to see it’s performance in an empty or a primed cache state. If you forget to switch this back, it gets reset automatically next time you restart Firefox.
Finally, Steve Lamm posted on the Google Code blog about testing slower connections as well as the high speed one that you are probably on, and the techniques for doing that with Hammerhead.
Steve continues to come up with small useful tools for Web developers. Thanks Steve!
Tuesday, September 30th, 2008
Category: Performance
, The Ajax Experience
We’ve heard a lot about optimizing CSS, HTML and JavaScript but one thing that is less talked about is how much extra information image editors put into image files. You might think you’ve done a great job optimizing your GIFs, PNGs and JPGs while still keeping them visually pleasing but when you use a text editor you’ll realize that there is quite a big amount of data you can save by removing information about the image editor used, the date the file was edited last and lots of other bits that really are redundant.
There are a lot of free tools that strip this information from the files for you and squeeze some extra optimization out of the file without affecting the look. The problem is that all of them are command-line based and you need to know how to use them. Stoyan Stefanov and Nicole Sullivan of the Yahoo exceptional performance team took all of these tools and their experience in using them and built one application that does all the optimizations for you in one go:

You can upload images, give it a URL or use smushit as a Firefox extension or bookmarklet. Smushit will show you how many bytes you can save by removing cruft from the images and gives you all the images as a zip file to replace them on your site.
Here’s a video of Stoyan and Nicole presenting Smushit.com at The Ajax Experience in Boston (sorry about the audio):
Saturday, September 20th, 2008
Category: JavaScript
, Performance
, Safari

While Ben and I were talking about JavaScript performance (and other things) at Web 2.0 Expo NYC, Maciej Stachowiak announced SquirrelFish Extreme, the very new and improved version that appears to do very well at SunSpider:
SquirrelFish Extreme: 943.3 ms
V8: 1280.6 ms
TraceMonkey: 1464.6 ms
What makes it so fast?
SquirrelFish Extreme uses four different technologies to deliver much better performance than the original SquirrelFish: bytecode optimizations, polymorphic inline caching, a lightweight “context threaded” JIT compiler, and a new regular expression engine that uses our JIT infrastructure.
1. Bytecode Optimizations
When we first announced SquirrelFish, we mentioned that we thought that the basic design had lots of room for improvement from optimizations at the bytecode level. Thanks to hard work by Oliver Hunt, Geoff Garen, Cameron Zwarich, myself and others, we implemented lots of effective optimizations at the bytecode level.
One of the things we did was to optimize within opcodes. Many JavaScript operations are highly polymorphic - they have different behavior in lots of different cases. Just by checking for the most common and fastest cases first, you can speed up JavaScript programs quite a bit.
In addition, we’ve improved the bytecode instruction set, and built optimizations that take advantage of these improvements. We’ve added combo instructions, peephole optimizations, faster handling of constants and some specialized opcodes for common cases of general operations.
2. Polymorphic Inline Cache
One of our most exciting new optimizations in SquirrelFish Extreme is a polymorphic inline cache. This is an old technique originally developed for the Self language, which other JavaScript engines have used to good effect.
Here is the basic idea: JavaScript is an incredibly dynamic language by design. But in most programs, many objects are actually used in a way that resembles more structured object-oriented classes. For example, many JavaScript libraries are designed to use objects with “x” and “y” properties, and only those properties, to represent points. We can use this knowledge to optimize the case where many objects have the same underlying structure - as people in the dynamic language community say, “you can cheat as long as you don’t get caught”.
So how exactly do we cheat? We detect when objects actually have the same underlying structure — the same properties in the same order — and associate them with a structure identifier, or StructureID. Whenever a property access is performed, we do the usual hash lookup (using our highly optimized hashtables) the first time, and record the StructureID and the offset where the property was found. Subsequent times, we check for a match on the StructureID - usually the same piece of code will be working on objects of the same structure. If we get a hit, we can use the cached offset to perform the lookup in only a few machine instructions, which is much faster than hashing.
Here is the classic Self paper that describes the original technique. You can look at Geoff’s implementation of the StructureID class in Subversion to see more details of how we did it.
We’ve only taken the first steps on polymorphic inline caching. We have lots of ideas on how to improve the technique to get even more speed. But already, you’ll see a huge difference on performance tests where the bottleneck is object property access.
3. Context Threaded JIT
Another major change we’ve made with SFX is to introduce native code generation. Our starting point is a technique called a “context threaded interpreter”, which is a bit of a misnomer, because this is actually a simple but effective form of JIT compiler. In the original SquirrelFish announcement, we described our use of direct threading, which is about the fastest form of bytecode intepretation short of generating native code. Context threading takes the next step and introduces some native code generation.
The basic idea of context threading is to convert bytecode to native code, one opcode at a time. Complex opcodes are converted to function calls into the language runtime. Simple opcodes, or in some cases the common fast paths of otherwise complex opcodes, are inlined directly into the native code stream. This has two major advantages. First, the control flow between opcodes is directly exposed to the CPU as straight line code, so much dispatch overhead is removed. Second, many branches that were formally between opcodes are now inline, and made highly predictable to the CPU’s branch predictor.
Here is a paper describing the basic idea of context threading. Our initial prototype of context threading was created by Gavin Barraclough. Several of us helped him polish it and tune the performance over the past few weeks.
One of the great things about our lightweight JIT is that there’s only about 4,000 lines of code involved in native code generation. All the other code remains cross platform. It’s also surprisingly hackable. If you thought compiling to native code is rocket science, think again. Besides Gavin, most of us have little prior experience with native codegen, but we were able to jump right in.
Currently the code is limited to x86 32-bit, but we plan to refactor and add support for more CPU architectures. CPUs that are not yet supported by the JIT can still use the interpreter. We also think we can get a lot more speedups out of the JIT through techniques such as type specialization, better register allocation and liveness analysis. The SquirrelFish bytecode is a good representation for making many of these kinds of transforms.
4. Regular Expression JIT
As we built the basic JIT infrastructure for the main JavaScript language, we found that we could easily apply it to regular expressions as well, and get up to a 5x speedup on regular expression matching. So we went ahead and did that. Not all code spends a bunch of time in regexps, but with the speed of our new regular expression engine, WREC (the WebKit Regular Expression Compiler), you can write the kind of text processing code you’d want to do in Perl or Python or Ruby, and do it in JavaScript instead. In fact we believe that in many cases our regular expression engine will beat the highly tuned regexp processing in those other languages.
Since the SunSpider JavaScript benchmark has a fair amount of regexp content, some may feel that developing a regexp JIT is an “unfair” advantage. A year ago, regexp processing was a fairly small part of the test, but JS engines have improved in other areas a lot more than on regexps. For example, most of the individual tests on SunSpider have gotten 5-10x faster in JavaScriptCore — in some cases over 70x faster than the Safari 3.0 version of WebKit. But until recently, regexp performance hadn’t improved much at all.
We thought that making regular expressions fast was a better thing to do than changing the benchmark. A lot of real tasks on the web involve a lot of regexp processing. After all, fundamental tasks on the web, like JSON validation and parsing, depend on regular expressions. And emerging technologies — like John Resig’s processing.js library — extend that dependency ever further.
Major kudos to the entire SFX team for pulling this off. Now, to grab a new nightly…
Friday, September 12th, 2008
Category: Performance
Steve has found a new tidbit that has him excited. The feature at hand comes from Opera
Primarily for low bandwidth devices, not well-tested on desktop. Ignore script tags until entire document is parsed and rendered, then execute all scripts in order and re-render.
Steve explains how you he is a fan of splitting up JavaScript into a small core, and then loading other functionality asynchronously later. This defer gives you some of that benefit, and also groks document.write, which no other technique works with:
One limitation of these techniques is that you can’t use document.write, because when a script is loaded asynchronously the browser has already written the document. Hardcore JavaScript programmers avoid document.write, but it’s still used in the real world most notably, and infamously, by ads. A feature of Opera’s “Delayed Script Execution” option is that, even though scripts are deferred, document.write still works correctly. Opera remembers the script’s location in the page and inserts the document.write output appropriately.
So, this is why Steve is interested to dive deeper and see if this has the performance benefits that make sense theoretically:
One immediate benefit of this Opera preference is that web developers can see the impact of delay-loading their JavaScript. A practice I’m advocating a lot lately is splitting a large JavaScript payload into two pieces, one of which can be loaded using an asynchronous script loading technique. This is often a complex task as the JavaScript payload grows in size and complexity. With this “Delayed Script Execution” feature in Opera, developers can get an idea of how their page would feel before undertaking the heavy lifting.
I’m even more excited about how this shows us what is possible for the future. To be able to have asynchronous script loading and preserve document.write output is like having your cake and eating it too. It’s difficult for users to find this feature in Opera. And it’s beyond the reach of web developers. But if Opera’s “Delayed Script Execution” behavior was the basis for implementing SCRIPT DEFER in all browsers, it would open the door for significant performance improvements by simply adding six characters (”DEFER ”).
This is most significant for the serving of ads. Often ads are served by including a script that contains document.write to load other resources: images, flash, or even another script. Ads are typically placed high in the page, which means today’s pages suffer from slow loading ads because all their content gets blocked. And really, it’s not the pages that suffer, it’s the users. Our experience suffers. Everyone’s experience suffers. If browsers supported an implementation of SCRIPT DEFER that behaved similar to Opera’s “Delayed Script Execution” feature, we’d all be better off.
Food for thought for Safari, Firefox, and IE.
Category: Performance
Sameer Chabungbam of Microsoft posted about the new JScript profiler the includes the following functionality:
- Provides performance data for JScript functions in two views:
- Functions View – a flat listing of all the functions
- Call Tree view – a hierarchical listing of the functions based on the call flow
- Supports exporting the data to a file
- Provides an inferred name for anonymous functions
- Profiles built-in JScript functions
- Supports multiple profile reports
- Supports profiling across page navigation and refreshes

Eric Pascarello has also been looking at new tools, and wrote up his experience with the Google Chrome Debugger. He details the breakpoint walking functionality as well as the many commands available.
Wednesday, September 3rd, 2008
Category: JavaScript
, Performance
Brendan Eich jumped right in and benchmarked the tip of tree for TraceMonkey, with the V8 version that came with Google Chrome:
We win on the bit-banging, string, and regular expression benchmarks. We are around 4x faster at the SunSpider micro-benchmarks than V8.
This graph does show V8 cleaning our clock on a couple of recursion-heavy tests. We have a plan, to trace recursion (not just tail recursion). We simply haven’t had enough hours in the day to get to it, but it’s “next”.
Brendan shows SunSpider running there, and V8 has that and other benchmarks to run too. Isn’t it great when a performance arms war is on? Thank god for competition here. We all win.
Ray Cromwell ran tests himself, on his own app Chronoscope (note, probably NOT using tip of tree TraceMonkey):
Chronoscope is written in GWT, and to some extent, the GWT compiler may negate some of Chrome’s V8 technology in the sense that GWT “de-classes” many OO polymorphic dispatches into a more functional style of programming, removing as much dynamic dispatch as possible, and eliminating prototype lookups and function call overhead through inlining. I don’t know if GWT hurts “hidden classes” or not, but it might be possible that if GWT didn’t provide such optimizations, the performance differential might be larger.
Despite this, the results are still good. The test consisted of calling the chart’s redraw() function 100 times per trial, with 10 trials. The slowest and fastest trial are thrown out, and the mean and standard deviation are calculated on the remaining data.
I tested on a Mac Pro 2.66Ghz with 6Gb of memory, OSX 1.5. The tests were conducted within a Parallels VM running XP2 Service Pack 2, given 2 CPUs and 2Gb of memory. For each browser, I rebooted the VM from a clean start, and ran only the test browser.
And for a bit of fun, Marc-Andre Cournoyer tied together HotRuby (remember that? the beast that runs YARV code in the browser!) and V8 to create fast Ruby in the browser.
Good times.
Wednesday, August 27th, 2008
Category: Performance
You have seen this before: /path/to/something.js?v=2, or maybe it used a date or a version control id or some such. The notion of putting the version into the URL so you can aggressively cache and yet quickly push new versions.
There has long been issues with using the querystring as the version. At some point I seem to remember Safari not going a good job caching that scenario and thinking that it was different.
Steve “Neo” Souders has posted about this issue especially as it relates to proxy servers and default configurations:
There’s a section in my book called Revving Filenames. It contains an example of adding a version number to the filename. That’s prompted several emails where people have asked me about tradeoffs around using a querystring versus embedding something in the filename. I wasn’t aware of any performance difference, but in a meeting this week a co-worker, Jacob Hoffman-Andrews, mentioned that Squid, a popular proxy, doesn’t cache resources with a querystring. This hurts performance when multiple users behind a proxy cache request the same file - rather than using the cached version everybody would have to send a request to the origin server.
I tested this by creating two resources, mylogo.1.2.gif and mylogo.gif?v=1.2. Both have a far future Expires date. I configured my browser to go through a Squid proxy. I made one request to mylogo.1.2.gif, cleared my cache (to simulate another user making the request), and fetched mylogo.1.2.gif again. This produces the following HTTP headers:
>> GET http://stevesouders.com/mylogo.1.2.gif HTTP/1.1
<< HTTP/1.0 200 OK
<< Date: Sat, 23 Aug 2008 00:17:22 GMT
<< Expires: Tue, 21 Aug 2018 00:17:22 GMT
<< X-Cache: MISS from someserver.com
<< X-Cache-Lookup: MISS from someserver.com
>> GET http://stevesouders.com/mylogo.1.2.gif HTTP/1.1
<< HTTP/1.0 200 OK
<< Date: Sat, 23 Aug 2008 00:17:22 GMT
<< Expires: Tue, 21 Aug 2018 00:17:22 GMT
<< X-Cache: HIT from someserver.com
<< X-Cache-Lookup: HIT from someserver.com
Notice that the second response shows a HIT in the X-Cache and X-Cache-Lookup headers. This shows it was served by the Squid proxy. More evidence of this is the fact that the Date and Expires response headers have the same values, even though I made these requests 10 seconds apart. For conclusive evidence, only one hit shows up in the stevesouders.com access log.
Loading mylogo.gif?v=1.2 twice (clearing the cache in between) results in these headers:
>> GET http://stevesouders.com/mylogo.gif?v=1.2 HTTP/1.1
<< HTTP/1.0 200 OK
<< Date: Sat, 23 Aug 2008 00:19:34 GMT
<< Expires: Tue, 21 Aug 2018 00:19:34 GMT
<< X-Cache: MISS from someserver.com
<< X-Cache-Lookup: MISS from someserver.com
>> GET http://stevesouders.com/mylogo.gif?v=1.2 HTTP/1.1
<< HTTP/1.0 200 OK
<< Date: Sat, 23 Aug 2008 00:19:47 GMT
<< Expires: Tue, 21 Aug 2018 00:19:47 GMT
<< X-Cache: MISS from someserver.com
<< X-Cache-Lookup: MISS from someserver.com
Here it’s clear the second response was not served by the proxy: the caching response headers say MISS, the Date and Expires values change, and tailing the stevesouders.com access log shows two hits.
Proxy administrators can change the configuration to support caching resources with a querystring, when the caching headers indicate that is appropriate. But the default configuration is what web developers should expect to encounter most frequently. Another interesting note about these tests: notice how the proxy downgrades the responses to HTTP/1.0. This is going to alter browser behavior in terms of the number of connections that are opened. When I’m doing performance analysis I make sure to avoid being connected through a proxy.
Monday, August 25th, 2008
Category: Performance
Razor Profiler is a web-based Ajax profiling tool to help web developers understand and analyze the runtime behavior of their JavaScript code in a cross-browser environment. Razor Profiler can be access either online as a service; or be downloaded to run locally, and was created by Coach Wei who has done a lot of work for Nexaweb and Apache.
Razor Profiler Features
Razor Profiler automates JavaScript profiling:
- Automation: no application code change required. Razor Profiler automatically collects all the necessary data and presents them to web developers for analysis.
- Runs on any browser: web developers can profile any JavaScript application on any browser. There is nothing to install on the client side.
- Rich lexical analysis: Razor Profiler presents rich lexcial information about the application, such as file information (number, response status, size, mimetype, percentage, etc), tokens (size, file, percent, count), and functions (size, file, name…), etc;
- Profile scenario recording: Razor Profile enables web developers to selectively record the scenarios that they are interested in. Only recorded scenarios will be used in analysis.
- Call stack analysis: for each recorded scenario, Razor Profiler presents all the call stacks in the order of their occurence. For each call stacks, web developers can drill into it to find out the duration of the stack, all the function calls of this stack and the duration of each call.
- Function analysis: For each JavaScript function in the application, Razor Profile presents the number of times it has been invoked, the duration of each invocation, and the call stacks that invoked this function.
- Data visualization with graphing and charting: Razor Profiler presents top call stacks, top function calls of each stack, top recorded scenarios, etc. using visual charts and graphs to help web developers better understand the runtime behavior of their application. For example, each call stack is visualized as an intuitive Gantt chart.
How Does Razor Profiler Work?
Razor Profiler composes of a server component that runs inside a standard Java EE Servlet engine, and a JavaScript-based client component that runs inside any browser. Once you have Razor server started, you can profile your JavaScript application by entering the start URL of your application into Razor Profiler and run through your test scenarios. Razor Profiler will automatically record data and visualize them for your analysis. There is no client side installation, browser configuration change or application code change required. In order to achieve this, Razor Profiler goes through five different phases:
- Application retrieval: Once a web developer enters the application start URL into Razor Profiler, Razor Profiler client component (”the client”) will send this URL to Razor Profiler server component (”the server”). The server performs the actually retrieval of this URL. After additional server processing (such as lexical analysis and code injection, see below), the retrieved content is sent to the client side to be displayed in a new browser window. For the developer point of view, the application is launched and running in this new browser window.
In this process, Razor Profiler Server is acting like a “proxy server”. But it is not really a “proxy server” and there is no need for developers to re-configure their browser proxy settings.
- Lexical analysis: Once the server retrieves the application URL, it performs lexical analysis of the returned content by identifying and analyzing JavaScript files, functions, and tokens,etc. The result is sent to the client for display.
- Code injection: Upon lexical analysis of JavaScript code, the server injects “probe” code into the application’s JavaScript sources before returning them to the client. These injected “probes” enable automatic collection of application runtime data, and saves developers from doing so manually.
- Runtime data capture: Once the application’s JavaScript code is running on the client side and as developers run through desired profile scenarios, the injected “probes” automcally collect all the necessary data to Razor Profiler Client.
- Data analysis: When the developer finishes recording scenarios and starts data analysis, Razor Profiler client performs analysis of all the collected data and presents the results.

Thursday, August 21st, 2008
Category: Performance

Patrick Meenan has setup an IE7 instance in Virginia that we can poke to do an AOL Page Test.
You give it a URL and some options such as the number of runs, whether to see the first and repeat views, and off it runs.
When finished you get to see the results which give you high level data on load times, waterfall graphs, an optimization check list, and a screenshot of what the browser saw.
If the waterfall is hard to read, send it to Steve Souders. He reads them like Neo reads the Matrix :)
Tuesday, August 12th, 2008
Category: JavaScript
, Mobile
, Performance
, iPhone
HTML:
-
-
-
function recurse(n) {
-
if (n> 0) {
-
return recurse(n - 1);
-
}
-
return 0;
-
}
-
-
try {
-
// recurse(43687); // Highest that works for me in WebKit
-
// nightly builds as of 24 Jul 2008.
-
// recurse(2999); // Highest that works for me in Firefox 3.0.1
-
// recurse(499); // Highest that works for me in Safari 3.1.2
-
recurse(3000);
-
document.write("Could be SquirrelFish.");
-
} catch(e) {
-
document.write("Not SquirrelFish.");
-
}
-
</script>
-
This is the hack that John Grubber used to test whether iPhone 2.x had snuck in SquirrelFish. He was curious due to the performance improvements that he witnessed:

What about iPhone limits though? David Golightly tests the limits on the iPhone with a script that keeps downloading tiles until it can no longer do so:
After downloading about 210 images, the iPhone simply stops downloading new ones. This is probably due to hitting the hard 30MB same-page resource limit.
Wednesday, July 30th, 2008
Category: Performance
, jQuery
Stuart Colville has found an issue where he needed to output some JavaScript in the middle of a page, before a library that depended on it was available:
The 6th Rule in Yahoo’s Performance Rules recommends placing script before the closing body tag to prevent blocking holding up the rendering of the page’s content. This works well but there are times where script needs to be output higher up in the page than it’s dependencies.
In this example I’m using jQuery but feel free to substitute jQuery for the your favorite framework.
The requirement is that there’s a need to run some code that would ideally use jQuery somewhere in the middle of the page. I could avoid the dependency and re-write everything without jQuery and for simple scripts this can be a good way to go. But, if I want to use some of the more complex jQuery features, then I really don’t want to have to re-invent the wheel or resort to including jQuery in the head of the document.
This lead him to the following example
HTML:
-
-
-
var muffin = muffin || {};
-
muffin.inline = muffin.inline || [];
-
muffin.inline.add = function(f){
-
muffin.inline[muffin.inline.length] = f;
-
};
-
</script>
-
-
-
muffin.inline.add(function(){
-
$('#green')[0].style.backgroundColor = 'green';
-
});
-
muffin.inline.add(function(){
-
$('#red')[0].style.backgroundColor = 'red';
-
});
-
</script>
-
-
<div id="red"><p>This should be Red
</p></div>
-
<div id="green"><p>This should be Green
</p></div>
-
-
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1/jquery.js"></script>
-
-
$(function(){
-
if (muffin && muffin.inline){
-
for (var i=0, j=muffin.inline.length; i<j ; i++){
-
muffin.inline[i]();
-
}
-
}
-
});
-
</script>
-
This seems a little niche. You my run into this as you have server side components outputting things, but ideally you can fix that in your architecture and ship the JavaScript in the correct location.
Monday, July 28th, 2008
Category: Browsers
, JavaScript
, Performance
Gregory Reimer, frontend engineer for sun.com, has written a barrage of tests to answer the question What's the Fastest Way to Code a Loop in JavaScript? specifically for large data sets:
I built a loop benchmarking test suite for different ways of coding loops in JavaScript. There are a few of these out there already, but I didn't find any that acknowledged the difference between native arrays and HTML collections. Since the underlying implementations are different (HTML collections for example lack the pop() and slice() methods, etc), benchmarks that don't test against both are probably missing important information.
My suspicions were confirmed. Accessing the length property is more expensive on HTML collections than on arrays, depending on the browser. In those cases, caching it made a huge difference. However, HTML collections are live, so a cached value may fail if the underlying DOM is modified during looping. On the other hand, HTML collections will never be sparse, so the best way to loop an HTML collection might just be to ignore the length property altogether and combine the test with the item lookup, since you have to do that anyway:
JAVASCRIPT:
-
-
// looping a dom html collection
-
for (var i=0, node; node = hColl[i++];) {
-
// do something with node
-
}
-
If you take a look at the results you will see that in general, reverse while loops are the fastest way to iterate a basic collection, e.g.:
JAVASCRIPT:
-
-
var i = arr.length; while (i--) {}
-
Take a peak at the test suite.
Friday, July 25th, 2008
Category: Debugging
, Performance

Steve Souders gave a talk at OSCON yesterday where he demonstrated the new Firebug Lite 1.2.
Today Firebug Lite 1.2 was released. This new version was built by Azer Koçulu, creator of pi.debugger. Azer joined the Firebug Working Group, morphed the GUI to look Firebug, and added it to the Firebug code base.
Firebug Lite is a subset of Firebug that can be used in IE, Opera, and Safari. The previous version provided console.log functionality. In Firebug Lite 1.2, Azer added the ability to inspect DOM elements, track XHRs, and navigate HTML, CSS, and JavaScript. You can embed it in your pages and enable debugging. I prefer creating a Firebug Lite bookmarklet that I can launch on any web page. Instructions and more information are available on the main Firebug Lite page.
If you like a little Firebug love when you debug non-Firefox browsers, check out the very much improved version!