Tuesday, August 7th, 2007
HTML 5 Parser Optimizations
<>p>Ian Hickson likes to get practical. He was able to run some reports on ~ten billion documents in the Google index, and used the data to be able to give real advice to HTML parser implementors.
As always, it is always interesting to see what real world data throws out at you.
The three sets of data that I posted are all derived from parsing several billion documents from Google’s Web search index using a parser I wrote in Sawzall.
The first set of data gives the relative aggregate distribution of invocations of the “in head”, “in body”, and “in table” insertion modes, for each of the insertion modes. This allows implementors to determine, for instance, that invoking the “in body” code while in a cell must be very efficient, while invoking the “in body” code from the “after frameset” code need not be as efficient, in case the implementor has a strategy that optimises one at the cost of another. See: documentation, data.
The second set of data gives the relative aggregate distribution of tokens for each phase/insertion mode pair. This can help implementors that are using a cascade of
ifstatements decide on the right order for their statements. For instance, the most common token type seen in the “in body” insertion mode is character data, and the second most token is the start tag token for anaelement, but theisindexstart tag was almost never seen. This tells implementors that they should check for characters andastart tags long before checking forisindextags. See: documentation, data.The last set of data examines the number of attributes per element. It allows implementors to decide on the optimum memory allocation strategy for attributes. For example, since most elements have 9 or fewer attributes, the data structure that stores attributes can be optimised for simply having 9 attributes, using little memory, and if an element has more than this number of attributes, the implementation can switch to a separate implementation that is more memory-heaving but is optimised for large numbers of attributes. See: data.
Related Content:











Who did what in the where now?
Ian Hickson analyzed a large sample of real data to propose optimization strategies to HTML 5 parser implementations. Basically, he runs three tests and calculates statistics that might make modern parsers faster.
Let’s take the second set of data as an example: If you know that the “a” element is the most frequent element in the “in body” insertion mode, you could make your parser more efficient by checking for the existence of that element right at the start instead of another order, such as lexicographically ascending. By carefully re-arranging the order of the if validations, you might gain several CPU cycles in the parsing of HTML documents.
Finally, the third set is a memory optimization strategy: if you assume that most elements have at most 9 attributes, you can change your data structure in the parser language accordingly to run faster for the most common case and less faster for the not so common case. You might allocate 9 fix pointers and hope for the best instead of initializing a list of pointers which is slower but works in the general case.
Whoa. Strangely, that makes sense.
Great summary Laurent.
Post advertisement at http://fivq.com/ . There you can offer your services, find gigs (short term jobs) or long term jobs. It is good place who would like to work remotely. Also there you can promote your website, your services. Between its absolutely free to post and no registration needed.
An article worth reading, and we hope that permeated the interest of all.
Thank you, based on this site