A Real Example: Matching CSV Records

Here's a real example from a technical support case I once handled. The customer was trying to find lines in a comma-delimited text file where the 12th item on a line started with a P. He was using the innocently-looking regexp ^(.*?,){11}P.
At first sight, this regex looks like it should do the job just fine. The lazy dot and comma match a single comma-delimited field, and the {11} skips the first 11 fields. Finally, the P checks if the 12th field indeed starts with P. In fact, this is exactly what will happen when the 12th field indeed starts with a P.
The problem rears its ugly head when the 12th field does not start with a P. Let's say the string is1,2,3,4,5,6,7,8,9,10,11,12,13. At that point, the regex engine will backtrack. It will backtrack to the point where ^(.*?,){11} had consumed 1,2,3,4,5,6,7,8,9,10,11, giving up the last match of the comma. The next token is again the dot. The dot matches a comma. The dot matches the comma! However, the comma does not match the 1 in the 12th field, so the dot continues until the 11th iteration of .*?, has consumed 11,12,. You can already see the root of the problem: the part of the regex (the dot) matching the contents of the field also matches the delimiter (the comma). Because of the double repetition (star inside {11}), this leads to a catastrophic amount of backtracking.
The regex engine now checks whether the 13th field starts with a P. It does not. Since there is no comma after the 13th field, the regex engine can no longer match the 11th iteration of .*?,. But it does not give up there. It backtracks to the 10th iteration, expanding the match of the 10th iteration to 10,11,. Since there is still no P, the 10th iteration is expanded to 10,11,12,. Reaching the end of the string again, the same story starts with the 9th iteration, subsequently expanding it to 9,10,9,10,11,9,10,11,12,. But between each expansion, there are more possibilities to be tried. When the 9th iteration consumes 9,10,, the 10th could match just 11, as well as11,12,. Continuously failing, the engine backtracks to the 8th iteration, again trying all possible combinations for the 9th, 10th, and 11th iterations.
You get the idea: the possible number of combinations that the regex engine will try for each line where the 12th field does not start with a P is huge. All this would take a long time if you ran this regex on a large CSV file where most rows don't have a P at the start of the 12th field.

Post a Comment