Regex Syntax versus String Syntax

Many programming languages support similar escapes for non-printable characters in their syntax for literal strings in source code. Then such escapes are translated by the compiler into their actual characters before the string is passed to the regex engine. If the regex engine does not support the same escapes, this can cause an apparent difference in behavior when a regex is specified as a literal string in source code compared with a regex that is read from a file or received from user input. For example, POSIX regular expressions do not support any of these escapes. But the C programming language does support escapes like \n and \x0A in string literals. So when developing an application in C using the POSIX library, \n is only interpreted as a newline when you add the regex as a string literal to your source code. Then the compiler interprets \n and the regex engine sees an actual newline character. If your code reads the same regex from a file, then the regex engine sees \n. Depending on the implementation, the POSIX library interprets this as a literal n or as an error. The actual POSIX standard states that the behavior of an "ordinary" character preceded by a backslash is "undefined".
A similar issue exists in Python 3.2 and prior with the Unicode escape \uFFFF. Python has supported this syntax as part of (Unicode) string literals ever since Unicode support was added to Python. But Python's re module only supports \uFFFF starting with Python 3.3. In Python 3.2 and earlier, \uFFFF works when you add your regex as a literal (Unicode) string to your Python code. But when your Python 3.2 script reads the regex from a file or user input, \uFFFF matches uFFFF literally as the regex engine sees \u as an escaped literal u.

Post a Comment