Repeating a Capturing Group vs. Capturing a Repeated Group

When creating a regular expression that needs a capturing group to grab part of the text matched, a common mistake is to repeat the capturing group instead of capturing a repeated group. The difference is that the repeated capturing group will capture only the last iteration, while a group capturing another group that's repeated will capture all iterations. An example will make this clear.
Let's say you want to match a tag like !abc! or !123!. Only these two are possible, and you want to capture theabc or 123 to figure out which tag you got. That's easy enough: !(abc|123)! will do the trick.
Now let's say that the tag can contain multiple sequences of abc and 123, like !abc123! or !123abcabc!. The quick and easy solution is !(abc|123)+!. This regular expression will indeed match these tags. However, it no longer meets our requirement to capture the tag's label into the capturing group. When this regex matches!abc123!, the capturing group stores only 123. When it matches !123abcabc!, it only stores abc.
This is easy to understand if we look at how the regex engine applies !(abc|123)+! to !abc123!. First, !matches !. The engine then enters the capturing group. It makes note that capturing group #1 was entered when the engine reached the position between the first and second character in the subject string. The first token in the group is abc, which matches abc. A match is found, so the second alternative isn't tried. (The engine does store a backtracking position, but this won't be used in this example.) The engine now leaves the capturing group. It makes note that capturing group #1 was exited when the engine reached the position between the 4th and 5th characters in the string.
After having exited from the group, the engine notices the plus. The plus is greedy, so the group is tried again. The engine enters the group again, and takes note that capturing group #1 was entered between the 4th and 5th characters in the string. It also makes note that since the plus is not possessive, it may be backtracked. That is, if the group cannot be matched a second time, that's fine. In this backtracking note, the regex engine also saves the entrance and exit positions of the group during the previous iteration of the group.
abc fails to match 123, but 123 succeeds. The group is exited again. The exit position between characters 7 and 8 is stored.
The plus allows for another iteration, so the engine tries again. Backtracking info is stored, and the new entrance position for the group is saved. But now, both abc and 123 fail to match !. The group fails, and the engine backtracks. While backtracking, the engine restores the capturing positions for the group. Namely, the group was entered between characters 4 and 5, and existed between characters 7 and 8.
The engine proceeds with !, which matches !. An overall match is found. The overall match spans the whole subject string. The capturing group spaces characters 5, 6 and 7, or 123. Backtracking information is discarded when a match is found, so there's no way to tell after the fact that the group had a previous iteration that matchedabc. (The only exception to this is the .NET regex engine, which does preserve backtracking information for capturing groups after the match attempt.)
The solution to capturing abc123 in this example should be obvious now: the regex engine should enter and leave the group only once. This means that the plus should be inside the capturing group rather than outside. Since we do need to group the two alternatives, we'll need to place a second capturing group around the repeated group:!((abc|123)+)!. When this regex matches !abc123!, capturing group #1 will store abc123, and group #2 will store 123. Since we're not interested in the inner group's match, we can optimize this regular expression by making the inner group non-capturing: !((?:abc|123)+)!.

Post a Comment