Thursday, September 25th, 2008
HTML Whitelist: Sanitize your markup
<p>HTML Whitelist is the latest in the "cool little Python Web service thrown up on App Engine" by my good colleague DeWitt Clinton.It does one thing, and it does it well. You can pass the service HTML and it will return a sanitized version.
For example:
There are a bunch of options. You can pass in HTML, pass a URL to the content, using JSON and JSONP, and different encoding options.
Related Content:











Often having to deal with third party content I wrote a white list based HTML sanitizer in JS. You can see it in action, and check the source code, in the feeds preview of Opera 9.6.
In a nutshell, it has a whitelist of nodeNames, with a whitelist of attributes and a function handler to validate their value and touch other attributes if needed.
Hi – this is a hot topic with people I’m talking to regarding user generated content. One thing I wonder is whether some elements such as -script- should be removed rather than left in the page, as it were. Like strip_tags in php but allowing a paragraph tag but stripping event handlers.
Great! This is exactly what I need for custom-markup shapes in my current project http://blok.appspot.com/ because iframes are not an option as a security model (Because I need to receive JS events from the shapes and because App Engine does not yet support wild card sub domains that would enable ‘building’ cross domain boundaries).
Great idea, though I hope everyone realizes they *must* do this on the server after the HTML have been submitted since otherwise it would be a 5 second job bypassing it ;)
.t
p01: as Thomas explained, Javascript validation is worthless from a security standpoint.
I wonder why they decided to implement this as a web service in the first place instead of just releasing the code like HTMLPurifier. Seems like a needless potential failure point.
Ah, it’s just a wrapper for html5lib. Sorry, should have RTFA.
Thomas, bander: In the case of the feeds preview for Opera 9.6, I don’t have a proxy server to sanitze the content. Would you mind spending a few seconds looking at my code ?