Tuesday, August 7th, 2007
Ian Hickson likes to get practical. He was able to run some reports on ~ten billion documents in the Google index
, and used the data to be able to give real advice to HTML parser implementors.
As always, it is always interesting to see what real world data throws out at you.
The first set of data gives the relative aggregate distribution of invocations of the “in head”, “in body”, and “in table” insertion modes, for each of the insertion modes. This allows implementors to determine, for instance, that invoking the “in body” code while in a cell must be very efficient, while invoking the “in body” code from the “after frameset” code need not be as efficient, in case the implementor has a strategy that optimises one at the cost of another. See: documentation, data.
The second set of data gives the relative aggregate distribution of tokens for each phase/insertion mode pair. This can help implementors that are using a cascade of
ifstatements decide on the right order for their statements. For instance, the most common token type seen in the “in body” insertion mode is character data, and the second most token is the start tag token for an
aelement, but the
isindexstart tag was almost never seen. This tells implementors that they should check for characters and
astart tags long before checking for
isindextags. See: documentation, data.
The last set of data examines the number of attributes per element. It allows implementors to decide on the optimum memory allocation strategy for attributes. For example, since most elements have 9 or fewer attributes, the data structure that stores attributes can be optimised for simply having 9 attributes, using little memory, and if an element has more than this number of attributes, the implementation can switch to a separate implementation that is more memory-heaving but is optimised for large numbers of attributes. See: data.
Posted by Dion Almaer at 6:14 pm