Friday, April 16th, 2010

Forgiving HTML Parser for Node and Browsers

Category: HTML, JavaScript

<p>Chris Winberry needed an HTML parser for a project he was working on and started to use John’s parser but found it to be a touch too strict for some of the HTML he was using (sloppy HTML? never). It was also too heavy to run on a server that would see considerable traffic, and so, being lazy, he wrote a new one from the ground up that is both light weight (extremely simple DOM) and very forgiving.

Which brings us to node-htmlparser which works in both Node:

javascript
< view plain text >
  1. var htmlparser = require("node-htmlparser");
  2. var rawHtml = "Xyz <script language= javascript>var foo = '< <bar>>';< /  script><!--<!-- Waah! -- -->";
  3. var handler = new htmlparser.DefaultHandler(function (error) {
  4.     if (error)
  5.       [...do something for errors...]
  6.     else
  7.       [...parsing done, do something...]
  8. });
  9. var parser = new htmlparser.Parser(handler);
  10. parser.ParseComplete(rawHtml);
  11. sys.puts(sys.inspect(handler.dom, false, null));

and on a modern browser:

javascript
< view plain text >
  1. var handler = new Tautologistics.NodeHtmlParser.DefaultHandler(function (error) {
  2.     if (error)
  3.       [...do something for errors...]
  4.     else
  5.       [...parsing done, do something...]
  6. });
  7. var parser = new Tautologistics.NodeHtmlParser.Parser(handler);
  8. parser.ParseComplete(document.body.innerHTML);
  9. alert(JSON.stringify(handler.dom, null, 2));

Posted by Dion Almaer at 9:49 am
5 Comments

+++--
3.4 rating from 15 votes

5 Comments »

Comments feed TrackBack URI

Right on time what I needed ! Thanks !

Comment by jeanph01 — April 16, 2010

God forbid a consistent naming convention……

Comment by TNO — April 16, 2010

@TNO – Thanks for the constructive criticism but would you care to elaborate a little more?

If it’s the “htmlparser” vs. “Tautologistics.NodeHtmlParser”, it is trivial to maintain server/browser parity by doing this for the require() in Node:

var Tautologistics = { NodeHtmlParser: require("node-htmlparser") }

…then you’ve got the exact same thing for both server and browser. In Node, a module (included script) is loaded into it’s own scope and, therefore, can not access the global scope (in this case, it can not define “Tautologistics.NodeHtmlParser”).

Comment by Tautologistics — April 16, 2010

@Tautologistics
Call like this “parser.ParseComplete(rawHtml)” looks very odd.

Java naming convention (Uppercase class names, lowercase methods) is very important for Javascript

Without reading the entire code, it is difficult to distinguish constructors from the usual method in your code.

Comment by Szsz — April 16, 2010

Gotcha and thanks. I’ve updated the code accordingly.

Comment by Tautologistics — April 16, 2010

Leave a comment

You must be logged in to post a comment.