Tuesday, August 7th, 2007

HTML 5 Parser Optimizations

Category: HTML

Ian Hickson likes to get practical. He was able to run some reports on ~ten billion documents in the Google index
, and used the data to be able to give real advice to HTML parser implementors.

As always, it is always interesting to see what real world data throws out at you.

The three sets of data that I posted are all derived from parsing several billion documents from Google’s Web search index using a parser I wrote in Sawzall.

The first set of data gives the relative aggregate distribution of invocations of the “in head”, “in body”, and “in table” insertion modes, for each of the insertion modes. This allows implementors to determine, for instance, that invoking the “in body” code while in a cell must be very efficient, while invoking the “in body” code from the “after frameset” code need not be as efficient, in case the implementor has a strategy that optimises one at the cost of another. See: documentation, data.

The second set of data gives the relative aggregate distribution of tokens for each phase/insertion mode pair. This can help implementors that are using a cascade of if statements decide on the right order for their statements. For instance, the most common token type seen in the “in body” insertion mode is character data, and the second most token is the start tag token for an a element, but the isindex start tag was almost never seen. This tells implementors that they should check for characters and a start tags long before checking for isindex tags. See: documentation, data.

The last set of data examines the number of attributes per element. It allows implementors to decide on the optimum memory allocation strategy for attributes. For example, since most elements have 9 or fewer attributes, the data structure that stores attributes can be optimised for simply having 9 attributes, using little memory, and if an element has more than this number of attributes, the implementation can switch to a separate implementation that is more memory-heaving but is optimised for large numbers of attributes. See: data.

Posted by Dion Almaer at 6:14 pm

3.3 rating from 21 votes


Comments feed TrackBack URI

Who did what in the where now?

Comment by Hehrold — August 7, 2007

Ian Hickson analyzed a large sample of real data to propose optimization strategies to HTML 5 parser implementations. Basically, he runs three tests and calculates statistics that might make modern parsers faster.

Let’s take the second set of data as an example: If you know that the “a” element is the most frequent element in the “in body” insertion mode, you could make your parser more efficient by checking for the existence of that element right at the start instead of another order, such as lexicographically ascending. By carefully re-arranging the order of the if validations, you might gain several CPU cycles in the parsing of HTML documents.

Finally, the third set is a memory optimization strategy: if you assume that most elements have at most 9 attributes, you can change your data structure in the parser language accordingly to run faster for the most common case and less faster for the not so common case. You might allocate 9 fix pointers and hope for the best instead of initializing a list of pointers which is slower but works in the general case.

Comment by Laurent Haan — August 8, 2007

Whoa. Strangely, that makes sense.

Comment by ch__ — August 8, 2007

Great summary Laurent.

Comment by Rob Di Marco — August 8, 2007

Post advertisement at http://fivq.com/ . There you can offer your services, find gigs (short term jobs) or long term jobs. It is good place who would like to work remotely. Also there you can promote your website, your services. Between its absolutely free to post and no registration needed.

Comment by Vytas — October 12, 2007

An article worth reading, and we hope that permeated the interest of all.

Thank you, based on this site

Comment by te3p — November 11, 2007

Leave a comment

You must be logged in to post a comment.