Friday, May 30th, 2008
Remove Nested Patterns with One Line of JavaScript
Steven Levithan has been flagrant by creating a simple way to remove nested patterns with a while loop and a replace:
-
-
var str = "abc<1<2<>3>4>def";
-
-
while (str != (str = str.replace(/<[^<>]*>/g, "")));
-
-
// str -> "abcdef"
-
Notice that the regex in this one-liner doesn't try to deal with nested patterns at all. The while loop's condition replaces instances of <…> (where angled brackets are not allowed in the inner pattern) with an empty string. This repeats from the inside out, until the regex no longer matches. At that point, the result of the replacement is the same as the subject string, and the loop ends.
You can use a similar approach to grab nested patterns rather than delete them, as shown below.
-
-
var str = "abc(d(e())f)(gh)ijk()",
-
re = /\([^()]*\)/,
-
output = [],
-
match, parts, last;
-
-
while (match = re.exec(str)) {
-
parts = match[0].split("\uFFFF");
-
if (parts.length <2)
-
last = output.push(match[0]) - 1;
-
else
-
output[last] = parts[0] + output[last] + parts[1];
-
str = str.replace(re, "\uFFFF");
-
}
-
-
// output -> ["(d(e())f)", "(gh)", "()"]
-












This is actually quite useful. It could be modified slightly to make a trim() function, or ltrim() or rtrim(). Cool.
A function based on this trick that lets you specify the delimiters.
Problem: I’m not a regex hero, so while it works with braces and angle brackets, it fails on parentheses. So how would it be fixed to handle things like that which collide with the regex interpretation? Is there an easy way to escape the delimiters?
//Function: killTags
// Removes nested tags
//
// Parameters:
// ld - left delimiter (for example: )
// str - string to be stripped
//
// Returns:
// str - stripped string
function killTags(ld,rd,str) {
while (str!=(str=str.replace(new RegExp(ld+"[^"+ld+rd+"]*"+rd),"g")));
return str;
};
Heh. I don’t think the “code” tag works quite right here at ajaxian.
// Parameters:
// ld – left delimiter
// rd – right delimiter
// str – string to be stripped
@Nosredna, I’m not sure what the killTags function would buy you vs. a simple `str = str.replace(/<[^>]*>/g, “”)`. Although the output would be different in edge cases, it wouldn’t really be any better or worse. For the most part, both would just remove all tags, and leave content within them alone.
However, this trick could indeed be used for some nested HTML handling. Let’s say you wanted to remove all div tags and their contents, accounting for nested and self-closed divs. (And for some reason you didn’t want to use the DOM to help.)
while (str != (str = str.replace(/<div\b[^>]*?\/>|<div\b[^>]*>(?:(?!<div\b[^>]*>|<\/div>)[\S\s])*<\/div>/gi, “”)));
Done.
> how would it be fixed to handle things like that which collide with the
> regex interpretation? Is there an easy way to escape the delimiters?
Here you go:
function escapeRegExp (str) {
return str.replace(/[-[\]{}()*+?.\\^$|,]/g, “\\$&”);
}
My XRegExp library provides something like this as XRegExp.escape, except that it also escapes “#” and whitespace because of its support for free-spacing/extended mode.
maybe I’m missing something, but isn’t this common knowledge?
Hope you’ll excuse a newbish question, but could someone spread a little light on the why and where of this.
I’ve not heard of them before and a google for ‘nested patterns’ tends to bring me back here.
jentulman, if you try to sanitize some third-party HTML code and remove all <script> tags, you may find that someone feeds your system <scri<script>pt>, where you would get rid of one of them but not both. I suppose this fix addresses this.
I like your concept a lot, as it seems to perform better than my similar solution.
One bug though.
The result of “(1(ab)2(cd)3)” must be the input string itself. But your second pattern yields [(ab), 1(cd)2].
Thanks for sharing.