Friday, May 30th, 2008

Remove Nested Patterns with One Line of JavaScript

Category: JavaScript, Tip

<>p>Steven Levithan has been flagrant by creating a simple way to remove nested patterns with a while loop and a replace:

javascript
< view plain text >
  1. var str = "abc&lt;1&lt;2<>3>4>def";
  2.  
  3. while (str != (str = str.replace(/< [^<>]*>/g, "")));
  4.  
  5. // str -> "abcdef"

Notice that the regex in this one-liner doesn’t try to deal with nested patterns at all. The while loop’s condition replaces instances of <…> (where angled brackets are not allowed in the inner pattern) with an empty string. This repeats from the inside out, until the regex no longer matches. At that point, the result of the replacement is the same as the subject string, and the loop ends.

You can use a similar approach to grab nested patterns rather than delete them, as shown below.

javascript
< view plain text >
  1. var str = "abc(d(e())f)(gh)ijk()",
  2.     re = /\([^()]*\)/,
  3.     output = [],
  4.     match, parts, last;
  5.  
  6. while (match = re.exec(str)) {
  7.     parts = match[0].split("\uFFFF");
  8.     if (parts.length < 2)
  9.         last = output.push(match[0]) - 1;
  10.     else
  11.         output[last] = parts[0] + output[last] + parts[1];
  12.     str = str.replace(re, "\uFFFF");
  13. }
  14.  
  15. // output -> ["(d(e())f)", "(gh)", "()"]

Related Content:

8 Comments »

Comments feed TrackBack URI

This is actually quite useful. It could be modified slightly to make a trim() function, or ltrim() or rtrim(). Cool.

Comment by starkraving — May 30, 2008

A function based on this trick that lets you specify the delimiters.

Problem: I’m not a regex hero, so while it works with braces and angle brackets, it fails on parentheses. So how would it be fixed to handle things like that which collide with the regex interpretation? Is there an easy way to escape the delimiters?


//Function: killTags
// Removes nested tags
//
// Parameters:
// ld - left delimiter (for example: )
// str - string to be stripped
//
// Returns:
// str - stripped string
function killTags(ld,rd,str) {
while (str!=(str=str.replace(new RegExp(ld+"[^"+ld+rd+"]*"+rd),"g")));
return str;
};

Comment by Nosredna — May 30, 2008

Heh. I don’t think the “code” tag works quite right here at ajaxian.

// Parameters:
// ld – left delimiter
// rd – right delimiter
// str – string to be stripped

Comment by Nosredna — May 30, 2008

@Nosredna, I’m not sure what the killTags function would buy you vs. a simple `str = str.replace(/<[^>]*>/g, “”)`. Although the output would be different in edge cases, it wouldn’t really be any better or worse. For the most part, both would just remove all tags, and leave content within them alone.

However, this trick could indeed be used for some nested HTML handling. Let’s say you wanted to remove all div tags and their contents, accounting for nested and self-closed divs. (And for some reason you didn’t want to use the DOM to help.)

while (str != (str = str.replace(/<div\b[^>]*?\/>|<div\b[^>]*>(?:(?!<div\b[^>]*>|<\/div>)[\S\s])*<\/div>/gi, “”)));

Done.

> how would it be fixed to handle things like that which collide with the
> regex interpretation? Is there an easy way to escape the delimiters?

Here you go:

function escapeRegExp (str) {
    return str.replace(/[-[\]{}()*+?.\\^$|,]/g, “\\$&”);
}

My XRegExp library provides something like this as XRegExp.escape, except that it also escapes “#” and whitespace because of its support for free-spacing/extended mode.

Comment by Steven Levithan — May 30, 2008

maybe I’m missing something, but isn’t this common knowledge?

Comment by bluesmoon — June 2, 2008

Hope you’ll excuse a newbish question, but could someone spread a little light on the why and where of this.
I’ve not heard of them before and a google for ‘nested patterns’ tends to bring me back here.

Comment by jentulman — June 2, 2008

jentulman, if you try to sanitize some third-party HTML code and remove all <script> tags, you may find that someone feeds your system <scri<script>pt>, where you would get rid of one of them but not both. I suppose this fix addresses this.

Comment by icoloma — June 3, 2008

I like your concept a lot, as it seems to perform better than my similar solution.

One bug though.

The result of “(1(ab)2(cd)3)” must be the input string itself. But your second pattern yields [(ab), 1(cd)2].

Thanks for sharing.

Comment by Stefan — June 4, 2008

Leave a comment

You must be logged in to post a comment.