Tuesday, September 2nd, 2008
toStaticHTML: Sanitize your HTML in IE 8
<p>The IE 8 beta has a new method, toStaticHTML that sanitizes HTML strings by removing dHTML elements and attributes from an HTML fragment.The example they give is:
-
-
<script type="text/javascript">
-
function sanitize()
-
{
-
var szInput = myDiv.innerHTML;
-
var szStaticHTML = toStaticHTML(szInput);
-
ResultComment = "ntoStaticHTML sanitized the HTML fragment as follows:n"
-
+ "Original Content:n" + szInput + "n"
-
+ "Static Content:n" + szStaticHTML + "n";
-
myDiv.innerText = ResultComment;
-
}
-
</script>
-
-
-
<body onload="sanitize()">
-
<div id="myDiv">
-
<script>function test() { alert("Testing, Testing, 123..."); }</script>
-
<span onclick="test()">Click Me</span>
-
</div>
-
</body>
-
Once sanitized this becomes just:
-
-
<span>Click Me</span>
-
Related Content:











Can we get this function for text inputs?
Anyone have a solid JavaScript version of this?
This regular expression should do something similar:
.
<script[\s\S]+?|(?<=]+)\son\w+=(['"])[\s\S]+?\1
.
I just knocked it up and ran a quick test on the example above though, so possibly not production quality :)
Oops. To be clear: use the regex to replace() with empty string.
Double oops. Should have HTML encoded before posting:
.
<script[\s\S]+?</script>|(?<=<[^>]+)\son\w+=(['"])[\s\S]+?\1
Triple oops. That’s only going to work in .NET or some other language that supports look-behinds. I’ll get my coat.
OK, so I’m really sorry for spamming this article. Last try – I hope this helps someone. This should work:
.
if (typeof toStaticHTML == "undefined") {
toStaticHTML = function(inputHtml) {
return inputHtml.replace(/<script[\s\S]+?<\/script>|(<[^>]+)\son\w+=(['"])[\s\S]+?\2/gi, "$1");
}
}
I wonder if this will also clean up some of the terrible mark up generated by Microsoft Office?
I recently had a bash at writing an HTML sanitizer/clean up in Javascript, using RegExes to check against a white list of allowed tags and attributes.
The result works OK, but was a bit slow and could probably do with some optimization love, must get around to sharing the code online soon, as it totally pwnd Office’s dodgy HTML and I’m sure others would find it useful too.
Clearly MS has not learnt any lessons from past endeavours, and are still creating needless and proprietary extensions to their browser.
All in a time where all their focus should be on catching up to the rest of the markets support for existing web standards.
Practical as the single function may be, it just helps to further delay the demise of IE6.
@Morgan – seriously? Not that its a perfect solution, but html sanitization is a real problem. As Jerome has just clearly demonstrated, html sanitization efforts are often done poorly (sorry Jerome, but thats a very simplistic solution). There are so many crazy ways for people to get script code into html and cause XSS attacks. This is because of how loose the browser can be in allowing scripts in. Who better to sanitize than a browser?!
@genericallyloud – I’d actually say that the browser is often the worst candidate for sanitization. i.e. if you’re sending back data to the server and expecting it to have been sanitized by the browser you’re in for a world of hurt.
.
Fair comment about the simplistic regex though :) I’d be interested to hear about ways to work around it.
.
e.g.
- the regex currently expects the onXXX value to be surrounded by [double]quotes
- need to match and remove: href=”javascript:xxx”
- need to match and remove: behaviour: xxx in a <style> tag
- other things…?
This makes for interesting reading: http://refactormycode.com/codes/333-sanitize-html
Why on earth is this a global method? Why not make it an instance method (or even a static method) on String? It’d be a proprietary augmentation of String, yes, but is that any worse than encroaching on window?
It is important to know that it is extremely difficult (if not nearly impossible) to sanitize all scripts out of a string using regexes (especially with those as simple as posted here).
The only way to properly do this would be to parse and tokenize the HTML. You can’t generalize that “on___” means javascript; you need to know the context in which the “on___” is found. Take the following example regex provided by Jerome above:
/<script[\s\S]+?|(]+)\son\w+=(['"])[\s\S]+?\2/giApply that regex to the following 2 sample snippets:
<input type="text" value=" only='text' " /><a href="javascript:alert('ok');">test</a>The first should not be affected by sanitation, but will be due to the value of the “value” attribute. The second one makes it through sanitation when it obviously shouldn’t (though I am unaware if this IE8 function will remove javascript:-prefixed URIs?).
If you attempt to hack HTML sanitation with regexes, you are going to forget to plug at least one hole and someone is bound to find and exploit it. At one time or another every developer seems to get it in their head that they “know what they’re doing” with regards to sanitizing HTML of malicious tags. So far I haven’t seen a single, simple regex solution that hasn’t been an utterly insecure hack.
Sigh, my first post and already hit by escaping issues.
s/</</g, nate. The second snippet (which turned into a link labeled “test”) was supposed to read:<a href="javascript:alert('ok');">test</a>Yeah, I’m convinced now that proper parsing and whitelisting of tags and their attributes is the only way you stand a chance.
Why would I need this?
Because maybe you have a site like Ajaxian where people can type in comments like I’m doing. If you don’t sanitize input you might end up serving rogue HTML created by some attacker which does a lot of nasty things to the browsers of your visitors and their PCs.
@pmontrasio:
you are missing the point, having that function in JS is completely useless. User input should be sanitized upon submission in server-side, so everything that hits the DB is already sanitized and cleared for safe outputting.