Thursday, September 25th, 2008

HTML Whitelist: Sanitize your markup

Category: HTML, Security

HTML Whitelist is the latest in the “cool little Python Web service thrown up on App Engine” by my good colleague DeWitt Clinton.

It does one thing, and it does it well. You can pass the service HTML and it will return a sanitized version.

For example:

  1. // original
  2. The <strong>quick</strong> brown fox <script src="http://evil.com"> jumps <kbd>over</kbd> the <em>lazy</em> dog.
  3.  
  4. // converted too
  5. The <strong>quick</strong> brown fox &lt;script src=&quot;http://evil.com&quot;&gt; jumps <kbd>over</kbd> the <em>lazy</em> dog.

There are a bunch of options. You can pass in HTML, pass a URL to the content, using JSON and JSONP, and different encoding options.

Posted by Dion Almaer at 2:54 am
7 Comments

+++--
3.2 rating from 17 votes

7 Comments »

Comments feed TrackBack URI

Often having to deal with third party content I wrote a white list based HTML sanitizer in JS. You can see it in action, and check the source code, in the feeds preview of Opera 9.6.

In a nutshell, it has a whitelist of nodeNames, with a whitelist of attributes and a function handler to validate their value and touch other attributes if needed.

Comment by p01 — September 25, 2008

Hi – this is a hot topic with people I’m talking to regarding user generated content. One thing I wonder is whether some elements such as -script- should be removed rather than left in the page, as it were. Like strip_tags in php but allowing a paragraph tag but stripping event handlers.

Comment by theboydan — September 25, 2008

Great! This is exactly what I need for custom-markup shapes in my current project http://blok.appspot.com/ because iframes are not an option as a security model (Because I need to receive JS events from the shapes and because App Engine does not yet support wild card sub domains that would enable ‘building’ cross domain boundaries).

Comment by Malde — September 25, 2008

Great idea, though I hope everyone realizes they *must* do this on the server after the HTML have been submitted since otherwise it would be a 5 second job bypassing it ;)

.t

Comment by ThomasHansen — September 25, 2008

p01: as Thomas explained, Javascript validation is worthless from a security standpoint.

I wonder why they decided to implement this as a web service in the first place instead of just releasing the code like HTMLPurifier. Seems like a needless potential failure point.

Comment by bander — September 25, 2008

Ah, it’s just a wrapper for html5lib. Sorry, should have RTFA.

Comment by bander — September 25, 2008

Thomas, bander: In the case of the feeds preview for Opera 9.6, I don’t have a proxy server to sanitze the content. Would you mind spending a few seconds looking at my code ?

Comment by p01 — September 26, 2008

Leave a comment

You must be logged in to post a comment.