Wednesday, March 10th, 2010

HTML Minification

Category: HTML, Performance

<>p>Good old Kangax has been playing with HTML minification and has shared his new tool in an early stage.

What does it do?

Kangax has forked John Resig’s HTML parser which parses the HTML and sends that into the Minifier. This has rules that do things like whitespace optimization, comment removal, and collapsing boolean attributes (e.g. disabled=”true” -> disabled).

He also has a linter going:

While working on minifier, I realized that oftentimes the most wasteful part of the markup is not white space, comments or boolean attributes, but inline styles, scripts, presentational or deprecated elements and attributes. None of these can be simply stripped, as that could affect state of the document and is just too obtrusive. What can be done, however, is reporting of these occurences to the user. HTMLLint is even a smaller script, whose job is exactly that—to log any deprecated or presentational elements/attributes encountered during parsing. Additionally, it detects event attributes (e.g. onclick, onmouseover, etc.). The rationale for this is that moving contents of event attributes to external script allows to take advantage of resource caching.

Related Content:

Posted by Dion Almaer at 6:14 am
17 Comments

+++--
3.6 rating from 30 votes

17 Comments »

Comments feed TrackBack URI

A tool for convert valid html to invalid html?

Comment by halan — March 10, 2010

…and; what serious developer is going to use both depreciated html content AND an html parser/checker?

Comment by sixtyseconds — March 10, 2010

…or for that matter, why would they use event attributes?

Comment by sixtyseconds — March 10, 2010

I gotta say, call me old fashioned but I like my HTML readable :)

Comment by iliad — March 10, 2010

@kangax, Magnolia CMS already uses a HTML minifier (some other CMS systems as well) I suggest you track those down and ‘borrow/port’ algorithms.
.
@rest, HTML minifiers are generally used to optimize pages before serverside caching. This is commonly abstracted away from developers and adopted by platforms (CMS/CRM/Portals). My guess would be that Kangax is building this for server side JS. It has zero value for development unless you deploy sites as raw html files.
.
As usual, the scope is grander then most ajaxian readers comprehend or care about.

Comment by BenGerrissen — March 10, 2010

@kangax, though established HTML minifiers are generally build by clueless backend developers who think HTML equals XML and tend to put a SAX parser in there somewhere… grrr, though the Magnolia team *fixed* that since Magnolia 3.0 I believe.

Comment by BenGerrissen — March 10, 2010

Here is an XSL version:

Comment by RobKoberg — March 10, 2010

And here it is escaped:

<xsl:stylesheet xmlns:xsl=”http://www.w3.org/1999/XSL/Transform” version=”2.0″>
<xsl:strip-space elements=”*”/>
<xsl:output indent=”no” method=”xhtml”/>
<xsl:template match=”@*|node()”>
<xsl:copy>
<xsl:apply-templates select=”@*|node()”/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

Comment by RobKoberg — March 10, 2010

Wow, congrats on getting the single line version of IE’s conditional comments working. I saw the comment saying that conditional comments worked but very few people even know about/check the single line form.

PLEASE make a setting to disable HTML validation/attribute removal. I use HTML Tidy to process our content server side, but the amount of hoops I’ve had to jump through to support ARIA, WebForms 2, new HTML5 elements, HTML5 custom data attributes, etc. has made me start to rethink the whole server-side processing thing. I’ve not come across a parser that will support these things. It’s almost maddening :(

Comment by blepore — March 10, 2010

Here’s another HTML minifier: http://willpeavy.com/minifier/

Comment by WillPeavy — March 10, 2010

It’s amusing to watch javascript developers bicker over 10 bytes of JS when HTML has so much room for optimization, not to mention it’s required on each page view, whereas JS stays cached.
This minifier is a great new tool; I’m looking forward to seeing people hooking it up to some automated build processes or output hooks.

Comment by PaulIrish — March 10, 2010

HTML minifiers have been around for a long time. And yes even in written in JavaScript. I have written one about a year ago for Aptana Jaxer, so i really don’t see what is new about this or different.. or why this is even news. Its not ground breaking…

Comment by V1 — March 10, 2010

What benefits would this approach have over GZipping the output?

Comment by waltr — March 10, 2010

@waltr, if you are gzipping theres no reason not to continue, this basically removes white space and redundant data (such as comments), so this would be something you additionally do (js is generally minified and served gzipped).

As for reading the html, it is imo that minification should be part of a compilation process, be that at deployment or at first run (if you have a more sophisticated runtime (which most sites do these days)), this way you continue to enjoy the source files for dev time, but produce a ‘compiled’ result for machine communication.

If you do find yourself in the situation you need to debug the live sites generated data you should consider having a toggle on the app (such as Debug/Release).

Comment by meandmycode — March 10, 2010

@BenGerrissen
Thanks. I’ll take a look at Magnolia CMS. And yes, I wish more people would understand the difference between HTML and XHTML (and ultimately, pointlessness of serving xhtml-like markup in such way that browsers still end up parsing it as HTML). Even first 2 comments in this thread demonstrate some of the misconceptions floating around.

@PaulIrish
Very good point about mobile browsers :)

@V1
I’d love to see your minifier. Before writing mine, I looked around but didn’t find that many. And even those I found were either not as efficient as they could have been, or overly aggressive (e.g. Google’s PageSpeed).

Comment by kangax — March 11, 2010

Cool work kangax. While I don’t think a client-side or server-side-JS HTML minifier has tooooo many practical applications, it’s definitely not enough to just gzip HTML, and call it a day @waltr.

Semantically stripping out stuff like comments, invalid attributes, moving inline CSS that should/could go elsewhere etc., and doing other cleanup before caching (as Ben mentioned above) is pretty great stuff.

@meandmycode, great idea re: client-side toggle

Comment by dariusz — March 12, 2010

this is useful tool for the SEOer, but it will not welcomed by the web designer.

Comment by sbmglobal — March 31, 2010

Leave a comment

You must be logged in to post a comment.