Tuesday, September 2nd, 2008

toStaticHTML: Sanitize your HTML in IE 8

Category: IE, JavaScript

The IE 8 beta has a new method, toStaticHTML that sanitizes HTML strings by removing dHTML elements and attributes from an HTML fragment.

The example they give is:

  1. <script type="text/javascript">
  2. function sanitize()
  3. {
  4.     var szInput = myDiv.innerHTML;
  5.     var szStaticHTML = toStaticHTML(szInput);
  6.     ResultComment = "\ntoStaticHTML sanitized the HTML fragment as follows:\n"
  7.         + "Original Content:\n" + szInput + "\n"
  8.         + "Static Content:\n" + szStaticHTML + "\n";
  9.     myDiv.innerText = ResultComment;
  10. }
  11. </script>
  12.  
  13.  
  14. <body onload="sanitize()">
  15.     <div id="myDiv">
  16.     <script>function test() { alert("Testing, Testing, 123..."); }</script>
  17.     <span onclick="test()">Click Me</span>
  18.     </div>
  19. </body>

Once sanitized this becomes just:

  1. <span>Click Me</span>

Posted by Dion Almaer at 7:51 am
19 Comments

+++--
3.8 rating from 12 votes

19 Comments »

Comments feed TrackBack URI

Can we get this function for text inputs?

Comment by genericallyloud — September 2, 2008

Anyone have a solid JavaScript version of this?

Comment by Nosredna — September 2, 2008

This regular expression should do something similar:
.
<script[\s\S]+?|(?<=]+)\son\w+=([‘”])[\s\S]+?\1
.
I just knocked it up and ran a quick test on the example above though, so possibly not production quality :)

Comment by Jerome — September 2, 2008

Oops. To be clear: use the regex to replace() with empty string.

Comment by Jerome — September 2, 2008

Double oops. Should have HTML encoded before posting:
.
<script[\s\S]+?</script>|(?<=<[^>]+)\son\w+=([‘"])[\s\S]+?\1

Comment by Jerome — September 2, 2008

Triple oops. That’s only going to work in .NET or some other language that supports look-behinds. I’ll get my coat.

Comment by Jerome — September 2, 2008

OK, so I’m really sorry for spamming this article. Last try – I hope this helps someone. This should work:
.
if (typeof toStaticHTML == "undefined") {
toStaticHTML = function(inputHtml) {
return inputHtml.replace(/<script[\s\S]+?<\/script>|(<[^>]+)\son\w+=([‘"])[\s\S]+?\2/gi, "$1");
}
}

Comment by Jerome — September 2, 2008

I wonder if this will also clean up some of the terrible mark up generated by Microsoft Office?

I recently had a bash at writing an HTML sanitizer/clean up in Javascript, using RegExes to check against a white list of allowed tags and attributes.

The result works OK, but was a bit slow and could probably do with some optimization love, must get around to sharing the code online soon, as it totally pwnd Office’s dodgy HTML and I’m sure others would find it useful too.

Comment by Rumble — September 2, 2008

Clearly MS has not learnt any lessons from past endeavours, and are still creating needless and proprietary extensions to their browser.

All in a time where all their focus should be on catching up to the rest of the markets support for existing web standards.

Practical as the single function may be, it just helps to further delay the demise of IE6.

Comment by MorganRoderick — September 2, 2008

@Morgan – seriously? Not that its a perfect solution, but html sanitization is a real problem. As Jerome has just clearly demonstrated, html sanitization efforts are often done poorly (sorry Jerome, but thats a very simplistic solution). There are so many crazy ways for people to get script code into html and cause XSS attacks. This is because of how loose the browser can be in allowing scripts in. Who better to sanitize than a browser?!

Comment by genericallyloud — September 2, 2008

@genericallyloud – I’d actually say that the browser is often the worst candidate for sanitization. i.e. if you’re sending back data to the server and expecting it to have been sanitized by the browser you’re in for a world of hurt.
.
Fair comment about the simplistic regex though :) I’d be interested to hear about ways to work around it.
.
e.g.
– the regex currently expects the onXXX value to be surrounded by [double]quotes
– need to match and remove: href=”javascript:xxx”
– need to match and remove: behaviour: xxx in a <style> tag
– other things…?

Comment by jeromew — September 2, 2008

This makes for interesting reading: http://refactormycode.com/codes/333-sanitize-html

Comment by jeromew — September 2, 2008

Why on earth is this a global method? Why not make it an instance method (or even a static method) on String? It’d be a proprietary augmentation of String, yes, but is that any worse than encroaching on window?

Comment by savetheclocktower — September 2, 2008

It is important to know that it is extremely difficult (if not nearly impossible) to sanitize all scripts out of a string using regexes (especially with those as simple as posted here).

The only way to properly do this would be to parse and tokenize the HTML. You can’t generalize that “on___” means javascript; you need to know the context in which the “on___” is found. Take the following example regex provided by Jerome above:

/<script[\s\S]+?|(]+)\son\w+=(['"])[\s\S]+?\2/gi

Apply that regex to the following 2 sample snippets:

<input type="text" value=" only='text' " />
<a href="javascript:alert('ok');">test</a>

The first should not be affected by sanitation, but will be due to the value of the “value” attribute. The second one makes it through sanitation when it obviously shouldn’t (though I am unaware if this IE8 function will remove javascript:-prefixed URIs?).

If you attempt to hack HTML sanitation with regexes, you are going to forget to plug at least one hole and someone is bound to find and exploit it. At one time or another every developer seems to get it in their head that they “know what they’re doing” with regards to sanitizing HTML of malicious tags. So far I haven’t seen a single, simple regex solution that hasn’t been an utterly insecure hack.

Comment by nate — September 3, 2008

Sigh, my first post and already hit by escaping issues. s/</&lt;/g, nate. The second snippet (which turned into a link labeled “test”) was supposed to read:

<a href="javascript:alert('ok');">test</a>

Comment by nate — September 3, 2008

Yeah, I’m convinced now that proper parsing and whitelisting of tags and their attributes is the only way you stand a chance.

Comment by Jerome — September 3, 2008

Why would I need this?

Comment by wwwmarty — September 3, 2008

Because maybe you have a site like Ajaxian where people can type in comments like I’m doing. If you don’t sanitize input you might end up serving rogue HTML created by some attacker which does a lot of nasty things to the browsers of your visitors and their PCs.

Comment by pmontrasio — September 3, 2008

@pmontrasio:
you are missing the point, having that function in JS is completely useless. User input should be sanitized upon submission in server-side, so everything that hits the DB is already sanitized and cleared for safe outputting.

Comment by gonchuki — September 22, 2008

Leave a comment

You must be logged in to post a comment.