Monday, May 5th, 2008
HTML Parser in JavaScript
John must have had some downtime on Sunday afternoon, as he implemented an HTML parser in JavaScript. The library, that you can play with via this demo, lets you attack HTML in a few ways:
A SAX-style API
Handles tag, text, and comments with callbacks. For example, let's say you wanted to implement a simple HTML to XML serialization scheme - you could do so using the following:
JAVASCRIPT:
var results = ""; HTMLParser("<p id=test>hello <i>world", { start: function( tag, attrs, unary ) { results += "<" + tag; for ( var i = 0; i <attrs.length; i++ ) results += " " + attrs[i].name + '="' + attrs[i].escaped + '"'; results += (unary ? "/" : "") + ">"; }, end: function( tag ) { results += ""; }, chars: function( text ) { results += text; }, comment: function( text ) { results += "<!--" + text + "-->"; } }); results == '<p id="test">hello <i>world</i></p>"XML Serializer
Now, there's no need to worry about implementing the above, since it's included directly in the library, as well. Just feed in HTML and it spits back an XML string.
JAVASCRIPT:
var results = HTMLtoXML("<p>Data: <input disabled/>") results == "</p><p>Data: <input disabled="disabled"/></p>"DOM Builder
If you're using the HTML parser to inject into an existing DOM document (or within an existing DOM element) then htmlparser.js provides a simple method for handling that:
JAVASCRIPT:
// The following is appended into the document body HTMLtoDOM("<p>Hello <b>World", document) // The follow is appended into the specified element HTMLtoDOM("<p>Hello <b>World", document.getElementById("test"))DOM Document Creator
This is a more-advanced version of the DOM builder - it includes logic for handling the overall structure of a web page, returning a new DOM document.
A couple points are enforced by this method:
- There will always be a html, head, body, and title element.
- There will only be one html, head, body, and title element (if the user specifies more, then will be moved to the appropriate locations and merged).
- link and base elements are forced into the head.
You would use the method like so:
JAVASCRIPT:
var dom = HTMLtoDOM("<p>Data: <input disabled/>"); dom.getElementsByTagName("body").length == 1 dom.getElementsByTagName("p").length == 1
One place that you could use this API would be on the server-side. For example, using Aptana Jaxer. Although, you could also interface directly to Java, or just use the Mozilla utilities directly.












Sounds great!
WOULD YOU PLEASE PLEASE PLEASE MODIFY THE AJAXIAN CSS SO THAT THE CONTENT AREA DOESN’T OVERFLOW FOR CODE AND VIDEOS? PLEASE?
PLEASE?
I *assume* you’re describing an (X)HTML parser… ;)
Since HTML parsing would basically be a nightmare…!!
@polterguy: Nope, this is an HTML parser (can handle some pretty bad HTML!). Check out the blog post for a full list of the edge cases that it handles (not everything, but it makes a good dent).
We began work a while ago on a c# routine similar in principal to this
http://serifcms.blogspot.com/2008/02/html-fixer.html
It was designed to fix poorly marked xml and any html pasted into an RTE from code generators and works by using events fired by opening and closing markup elements. We needed this code so that we could save anything into our xml database and manipulate it.
Thanks. Good article.