Runaway Regular Expressions: Catastrophic Backtracking

Consider the regular expression (x+x+)+y. Before you scream in horror and say this contrived example should be written as xx+y to match exactly the same without those terribly nested quantifiers: just assume that each "x" represents something more complex, with certain strings being matched by both "x". See the section on HTML files below for a real example.
Let's see what happens when you apply this regex to xxxxxxxxxxy. The first x+ will match all 10 x characters. The second x+ fails. The first x+ then backtracks to 9 matches, and the second one picks up the remaining x. The group has now matched once. The group repeats, but fails at the first x+. Since one repetition was sufficient, the group matches. y matches y and an overall match is found. The regex is declared functional, the code is shipped to the customer, and his computer explodes. Almost.
The above regex turns ugly when the y is missing from the subject string. When y fails, the regex engine backtracks. The group has one iteration it can backtrack into. The second x+ matched only one x, so it can't backtrack. But the first x+ can give up one x. The second x+ promptly matches xx. The group again has one iteration, fails the next one, and the y fails. Backtracking again, the second x+ now has one backtracking position, reducing itself to match x. The group tries a second iteration. The first x+ matches but the second is stuck at the end of the string. Backtracking again, the first x+ in the group's first iteration reduces itself to 7 characters. The second x+ matches xxx. Failing y, the second x+ is reduced to xx and then x. Now, the group can match a second iteration, with one x for each x+. But this (7,1),(1,1) combination fails too. So it goes to (6,4) and then (6,2)(1,1) and then (6,1),(2,1) and then (6,1),(1,2) and then I think you start to get the drift.
If you try this regex on a 10x string in RegexBuddy's debugger, it'll take 2558 steps to figure out the final y is missing. For an 11x string, it needs 5118 steps. For 12, it takes 10238 steps. Clearly we have an exponential complexity of O(2^n) here. At 21x the debugger bows out at 2.8 million steps, diagnosing a bad case of catastrophic backtracking.
RegexBuddy is forgiving in that it detects it's going in circles, and aborts the match attempt. Other regex engines (like .NET) will keep going forever, while others will crash with a stack overflow (like Perl, before version 5.10). Stack overflows are particularly nasty on Windows, since they tend to make your application vanish without a trace or explanation. Be very careful if you run a web service that allows users to supply their own regular expressions. People with little regex experience have surprising skill at coming up with exponentially complex regular expressions.

Post a Comment