Monday, May 5th, 2008

HTML Parser in JavaScript

Category: HTML, JavaScript

John must have had some downtime on Sunday afternoon, as he implemented an HTML parser in JavaScript. The library, that you can play with via this demo, lets you attack HTML in a few ways:

A SAX-style API

Handles tag, text, and comments with callbacks. For example, let’s say you wanted to implement a simple HTML to XML serialization scheme – you could do so using the following:

javascript

  1. var results = "";
  2.  
  3. HTMLParser("<p id=test>hello <i>world", {
  4.   start: function( tag, attrs, unary ) {
  5.     results += "< " + tag;
  6.  
  7.     for ( var i = 0; i < attrs.length; i++ )
  8.       results += " " + attrs[i].name + '="' + attrs[i].escaped + '"';
  9.  
  10.     results += (unary ? "/" : "") + ">";
  11.   },
  12.   end: function( tag ) {
  13.     results += "";
  14.   },
  15.   chars: function( text ) {
  16.     results += text;
  17.   },
  18.   comment: function( text ) {
  19.     results += "<!--" + text + "-->";
  20.   }
  21. });
  22.  
  23. results == '<p id="test">hello <i>world</i></p>"

XML Serializer

Now, there’s no need to worry about implementing the above, since it’s included directly in the library, as well. Just feed in HTML and it spits back an XML string.

javascript

  1. var results = HTMLtoXML("<p>Data: <input disabled/>")
  2. results == "</p><p>Data: <input disabled="disabled"/></p>"

DOM Builder

If you’re using the HTML parser to inject into an existing DOM document (or within an existing DOM element) then htmlparser.js provides a simple method for handling that:

javascript

  1. // The following is appended into the document body
  2. HTMLtoDOM("<p>Hello <b>World", document)
  3.  
  4. // The follow is appended into the specified element
  5. HTMLtoDOM("<p>Hello <b>World", document.getElementById("test"))

DOM Document Creator

This is a more-advanced version of the DOM builder – it includes logic for handling the overall structure of a web page, returning a new DOM document.

A couple points are enforced by this method:

  • There will always be a html, head, body, and title element.
  • There will only be one html, head, body, and title element (if the user specifies more, then will be moved to the appropriate locations and merged).
  • link and base elements are forced into the head.

You would use the method like so:

javascript

  1. var dom = HTMLtoDOM("<p>Data: <input disabled/>");
  2. dom.getElementsByTagName("body").length == 1
  3. dom.getElementsByTagName("p").length == 1

One place that you could use this API would be on the server-side. For example, using Aptana Jaxer. Although, you could also interface directly to Java, or just use the Mozilla utilities directly.

Posted by Dion Almaer at 10:51 am
4 Comments

+++--
3.2 rating from 37 votes

4 Comments »

Comments feed TrackBack URI

Sounds great!

Comment by ViniciusCamara — May 5, 2008

WOULD YOU PLEASE PLEASE PLEASE MODIFY THE AJAXIAN CSS SO THAT THE CONTENT AREA DOESN’T OVERFLOW FOR CODE AND VIDEOS? PLEASE?

PLEASE?

Comment by Trevor — May 5, 2008

I *assume* you’re describing an (X)HTML parser… ;)
Since HTML parsing would basically be a nightmare…!!

Comment by polterguy — May 5, 2008

@polterguy: Nope, this is an HTML parser (can handle some pretty bad HTML!). Check out the blog post for a full list of the edge cases that it handles (not everything, but it makes a good dent).

Comment by JohnResig — May 5, 2008

Leave a comment

You must be logged in to post a comment.