Activate your free membership today | Log-in

Monday, May 5th, 2008

HTML Parser in JavaScript

Category: HTML, JavaScript

John must have had some downtime on Sunday afternoon, as he implemented an HTML parser in JavaScript. The library, that you can play with via this demo, lets you attack HTML in a few ways:

A SAX-style API

Handles tag, text, and comments with callbacks. For example, let's say you wanted to implement a simple HTML to XML serialization scheme - you could do so using the following:

JAVASCRIPT:
  1.  
  2. var results = "";
  3.  
  4. HTMLParser("<p id=test>hello <i>world", {
  5.   start: function( tag, attrs, unary ) {
  6.     results += "<" + tag;
  7.  
  8.     for ( var i = 0; i <attrs.length; i++ )
  9.       results += " " + attrs[i].name + '="' + attrs[i].escaped + '"';
  10.  
  11.     results += (unary ? "/" : "") + ">";
  12.   },
  13.   end: function( tag ) {
  14.     results += "";
  15.   },
  16.   chars: function( text ) {
  17.     results += text;
  18.   },
  19.   comment: function( text ) {
  20.     results += "<!--" + text + "-->";
  21.   }
  22. });
  23.  
  24. results == '<p id="test">hello <i>world</i></p>"

XML Serializer

Now, there's no need to worry about implementing the above, since it's included directly in the library, as well. Just feed in HTML and it spits back an XML string.

JAVASCRIPT:
  1.  
  2. var results = HTMLtoXML("<p>Data: <input disabled/>")
  3. results == "</p><p>Data: <input disabled="disabled"/></p>"
  4.  

DOM Builder

If you're using the HTML parser to inject into an existing DOM document (or within an existing DOM element) then htmlparser.js provides a simple method for handling that:

JAVASCRIPT:
  1.  
  2. // The following is appended into the document body
  3. HTMLtoDOM("<p>Hello <b>World", document)
  4.  
  5. // The follow is appended into the specified element
  6. HTMLtoDOM("<p>Hello <b>World", document.getElementById("test"))
  7.  

DOM Document Creator

This is a more-advanced version of the DOM builder - it includes logic for handling the overall structure of a web page, returning a new DOM document.

A couple points are enforced by this method:

  • There will always be a html, head, body, and title element.
  • There will only be one html, head, body, and title element (if the user specifies more, then will be moved to the appropriate locations and merged).
  • link and base elements are forced into the head.

You would use the method like so:

JAVASCRIPT:
  1.  
  2. var dom = HTMLtoDOM("<p>Data: <input disabled/>");
  3. dom.getElementsByTagName("body").length == 1
  4. dom.getElementsByTagName("p").length == 1
  5.  

One place that you could use this API would be on the server-side. For example, using Aptana Jaxer. Although, you could also interface directly to Java, or just use the Mozilla utilities directly.

Posted by Dion Almaer at 10:51 am

+++--
3.1 rating from 23 votes

6 Comments »

Comments feed TrackBack URI

Sounds great!

Comment by ViniciusCamara — May 5, 2008

WOULD YOU PLEASE PLEASE PLEASE MODIFY THE AJAXIAN CSS SO THAT THE CONTENT AREA DOESN’T OVERFLOW FOR CODE AND VIDEOS? PLEASE?

PLEASE?

Comment by Trevor — May 5, 2008

I *assume* you’re describing an (X)HTML parser… ;)
Since HTML parsing would basically be a nightmare…!!

Comment by polterguy — May 5, 2008

@polterguy: Nope, this is an HTML parser (can handle some pretty bad HTML!). Check out the blog post for a full list of the edge cases that it handles (not everything, but it makes a good dent).

Comment by JohnResig — May 5, 2008

We began work a while ago on a c# routine similar in principal to this

http://serifcms.blogspot.com/2008/02/html-fixer.html

It was designed to fix poorly marked xml and any html pasted into an RTE from code generators and works by using events fired by opening and closing markup elements. We needed this code so that we could save anything into our xml database and manipulate it.

Comment by 800px — May 6, 2008

Thanks. Good article.

Comment by videoizle — May 8, 2008

Leave a comment

You must be logged in to post a comment.