Possessive Quantifiers and Atomic Grouping to The Rescue

In the above example, the sane thing to do is obviously to rewrite it as xx+y which eliminates the nested quantifiers entirely. Nested quantifiers are repeated or alternated tokens inside a group that is itself repeated or alternated. These almost always lead to catastrophic backtracking. About the only situation where they don't is when the start of each alternative inside the group is not optional, and mutually exclusive with the start of all the other alternatives, and mutually exclusive with the token that follows it (inside its alternative inside the group). E.g. (a+b+|c+d+)+y is safe. If anything fails, the regex engine will backtrack through the whole regex, but it will do so linearly. The reason is that all the tokens are mutually exclusive. None of them can match any characters matched by any of the others. So the match attempt at each backtracking position will fail, causing the regex engine to backtrack linearly. If you test this on aaaabbbbccccdddd, RegexBuddy needs only 13 steps rather than millions of steps to figure it out.
However, it's not always possible or easy to rewrite your regex to make everything mutually exclusive. So we need a way to tell the regex engine not to backtrack. When we've grabbed all the x's, there's no need to backtrack. There couldn't possibly be a y in anything matched by either x+. Using a possessive quantifier, our regex becomes(x+x+)++y. This fails the 21x string in merely 7 steps. That's 6 steps to match all the x's, and 1 step to figure out that y fails. Almost no backtracking is done. Using an atomic group, the regex becomes (?>(x+x+)+)y with the exact same results.

Post a Comment