Wednesday, July 8th, 2009

HTML 5 Parser Lands in Gecko

Category: HTML

John Resig has blogged about HTML 5 parsing and the news that Henri Sivonen (the chap who did the HTML 5 validator) has landed a massive commit to the trunk of Firefox that includes an HTML 5 parser.

The method is quite interesting:

What’s interesting about this particular implementation is that it’s actually an automated conversion of Henri’s Java HTML 5 parser to C++. This conversion happens automatically and changes will be pushed upstream to the Mozilla codebase.

Normally I would balk at the mention of a wholesale, programmatic, conversion of a Java codebase over to C++ but the results have been very surprising: A 3% boost in pageload performance.

And this is on top of the litany of bug fixes and compliance checks that this code base will be providing. You can examine some of the progress that went into the constructing the patch in the Mozilla bug.

If you’re interested in giving the new parser a try (it’s doubtful that you’ll see many obvious changes – but any help in hunting down bugs would be appreciated) you can download a nightly of Firefox, open about:config, and set html5.enable to true.

For extra fun, throw in some inline SVG and see it just work! Bye bye namespaces!

Pithy HTML5/XHTML comments

Dan Morill (Android and formerly GWT fame and al-round good guy) had some funny remarks on the XHTML/HTML5 kerfuffle:

An exercise: I can easily summarize HTML5 in a single Tweet. I can’t think of a way to do that for XHTML. “HTML5 codifies existing behaviors and is a practitioner’s roadmap for the future of browser capabilities.”

This “death of XHTML” meme is awesome, it’s soooo easy to bust out with pithy zingers.

Here’s one: “The web *itself* is content soup, why should we expect HTML to be more than tag soup?”

Another: “XHTML was the Edsel of the web: painstakingly designed, proudly touted, and utterly missing the point.”

“They finally closed the tag on XHTML, and now the web is validated.”

Posted by Dion Almaer at 6:27 am

3.1 rating from 28 votes


Comments feed TrackBack URI

Another: “XHTML was the Edsel of the web: painstakingly designed, proudly touted, and utterly missing the point.”
That’s horrible. I’ve waited years for XHTML to reach a point in browser implementation where I can have an XML file with CSS/JavaScript attached, and now nothing?!? If HTML5 is the way to go, then the specs for the next 10 years should be a recormendation already, if we are to learn anything from history.

I think whoever wrote the quote is missing the point!

Comment by dotnetCarpenter — July 8, 2009

Hurray ! we’re good for another 15 years of broken document structure and lack of basic features such as namespace and extensibility.
Thank you WHAT-WG for hearing our needs and make sure that nothing get fixed.

Comment by ywg — July 8, 2009

So does this mean everyone has realised that the only time a website would ever need to be “semantic” is if someone is stupid enough to parse it and use a full DOM rather than a regex?

Comment by Darkimmortal — July 8, 2009

Hi Darkimmortal,

Actually there are several reasons why semantic markup is useful, and you should use it for your sites:

1. People with different abilities cannot always surf the web with a browser and a mouse. They use adaptive technologies such as screen readers which provide a significantly better experience to that user when your markup is semantic.

2. Semantic markup is more easily re-styled for different media/formats. You can use the same semantic markup for a desktop browser version of your page, a mobile version of your page, and a printed version of your page with just a little CSS work.

3. Search engines reward you for semantic markup. Semantic markup improves your SEO.

4. The future of the web will involve a lot of systems where computers talk to other computers and where web apps automagically read web pages. Computers are shockingly stupid at understanding language and meaning. Defining your markup in a semantic way — especially if we all agree on a way to do it — will allow for programmers to create web app mash-ups and rich search apps and all sorts of cool things with much less effort because computers will have an easier time finding the right data. Regex is a nice way to do quick and dirty screen scraping… for a particular site… when you know the structure of their markup. It is not so hot when you want to find a particular type of information for *any* site (e.g. good luck writing a regex to find the hours a business is open for any arbitrary business’s website.)

Also, Darkimmortal, I think you may be a little confused since HTML5 actually deliberately adds many semantic features.

Oh, and since you seem a little confused, lets make sure we’re talking about the same thing: when people talk about a “semantic” website, what they are actually referring to is the markup underlying that website’s pages.

And, without getting into it too much, “semantic markup” means that the markup imparts information about the content it wraps. So for example, My Heading is semantic because the tag tells me that “My Heading” is a level 1 heading. Whereas, My Heading is not semantic because it tells me nothing about “My Heading.”

I hope this helps.

Comment by zachstronaut — July 8, 2009

My tags were stripped… the examples were [h1]My Heading[/h1] and [div]My Heading[/div].

Comment by zachstronaut — July 8, 2009


Comment by eyelidlessness — July 8, 2009

WHAT?7 A hand-written parser in the year 2009 AD?

These people are mad.

Comment by chiaroscuro — July 8, 2009

@ywg: it’s a pity you try to write witty comments without even knowing the subject.
Let’s see what HTML5 fixes:

it has a most complete parsing algorithm. That means that browsers implementing HTML5 will agree how to treat given document and will
have consistent DOM even if markup is broken. All the broken html
that was produced in yours 15 years is not going anywhere, and HTML5
prescribes how to deal with this content in predictable manner.

it actually let’s you to define document structure better thanks to the new elements and fixed headings scheme.

HTML5 did not abolish XML syntax – it has XML serialization if you need one.

it allows you to include SVG or MathML even in HTML serialization.

it has support of the Apple, Mozilla, Opera.

Of course, you still have a right to think that pipe-dream spec which XTHML2 was with zero interest among browser vendors and incompatible with existing web would actually fix anything.

Comment by Rimantas — July 9, 2009

I like how tagsoup is now a feature, and as such was totally in scope of HTML 5 to specify how to handle it, so it goes to extraordinary lengths to specify how to parse it consistently into the DOM.

Contrast that to XHTML, where serving it without the XML MIME type, necessary because IE won’t implement standards, has a total of half a dozen issues, the most important being that browsers treat it as tagsoup, but that is considered a show-stopper by the HTML 5 spec lead.

But lets all rag on XHTML now that through massive effort the faults of HTML might be fixed 10 years from now when HTML5 gets universally adopted.

Comment by JonathanLeech — July 9, 2009

Ofcourse tag soup is a feature. Most of the content of the web is made by people who lack the knowledge to write validating XHTML. What’s more realistic? Modifying the technology a little bit so these people can keep doing what they do without sabotaging the rest of us, or re-educating half the internet? The incredible cost of educating all those people is very wasteful, and can be much better spent on things that actually improve the internet (like putting up more content).

Comment by Joeri — July 10, 2009

Leave a comment

You must be logged in to post a comment.