Friday, April 16th, 2010
Forgiving HTML Parser for Node and Browsers
<p>Chris Winberry needed an HTML parser for a project he was working on and started to use John's parser but found it to be a touch too strict for some of the HTML he was using (sloppy HTML? never). It was also too heavy to run on a server that would see considerable traffic, and so, being lazy, he wrote a new one from the ground up that is both light weight (extremely simple DOM) and very forgiving.Which brings us to node-htmlparser which works in both Node:
-
-
var htmlparser = require("node-htmlparser");
-
var rawHtml = "Xyz <script language= javascript>var foo = '<<bar>>';</ script><!--<!-- Waah! -- -->";
-
var handler = new htmlparser.DefaultHandler(function (error) {
-
if (error)
-
[...do something for errors...]
-
else
-
[...parsing done, do something...]
-
});
-
var parser = new htmlparser.Parser(handler);
-
parser.ParseComplete(rawHtml);
-
sys.puts(sys.inspect(handler.dom, false, null));
-
and on a modern browser:
-
-
var handler = new Tautologistics.NodeHtmlParser.DefaultHandler(function (error) {
-
if (error)
-
[...do something for errors...]
-
else
-
[...parsing done, do something...]
-
});
-
var parser = new Tautologistics.NodeHtmlParser.Parser(handler);
-
parser.ParseComplete(document.body.innerHTML);
-
alert(JSON.stringify(handler.dom, null, 2));
-
Related Content:











Right on time what I needed ! Thanks !
God forbid a consistent naming convention……
@TNO – Thanks for the constructive criticism but would you care to elaborate a little more?
If it’s the “htmlparser” vs. “Tautologistics.NodeHtmlParser”, it is trivial to maintain server/browser parity by doing this for the require() in Node:
var Tautologistics = { NodeHtmlParser: require("node-htmlparser") }…then you’ve got the exact same thing for both server and browser. In Node, a module (included script) is loaded into it’s own scope and, therefore, can not access the global scope (in this case, it can not define “Tautologistics.NodeHtmlParser”).
@Tautologistics
Call like this “parser.ParseComplete(rawHtml)” looks very odd.
Java naming convention (Uppercase class names, lowercase methods) is very important for Javascript
Without reading the entire code, it is difficult to distinguish constructors from the usual method in your code.
Gotcha and thanks. I’ve updated the code accordingly.