Mixing Unicode and 8-bit Character Codes

Internally, computers deal with numbers, not with characters. When you save a text file, each character is mapped to a number, and the numbers are stored on disk. When you open a text file, the numbers are read and mapped back to characters. When processing text with a regular expression, the regular expression needs to use the same mapping as you used to create the file or string you want the regex to process.
When you simply type in all the characters in your regular expression, you normally don't have anything to worry about. The application or programming library that provides the regular expression functionality will know what text encodings your subject string uses, and process it accordingly. So if you want to search for the euro currency symbol, and you have a European keyboard, just press AltGr+E. Your regex  will find all euro symbols just fine.
But you can't press AltGr+E on a US keyboard. Or perhaps you like your source code to be 7-bit clean (i.e. plain ASCII). In those cases, you'll need to use a character escape in your regular expression.
If your regular expression engine supports Unicode, simply use the Unicode escape \u20AC (most Unicode flavors) or \x{20AC} (Perl and PCRE). U+20AC is the Unicode code point for the euro symbol. It will always match the euro symbol, whether your subject string is encoded in UTF-8, UTF-16, UCS-2 or whatever. Even when your subject string is encoded with a legacy 8-bit code page, there's no confusion. You may need to tell the application or regex engine what encoding your file uses. But \u20AC is always the euro symbol.
Most Unicode regex engines also support the 8-bit character escape \xFF. However, its use is not recommended. For characters \x00 through \x7F, there's usually no trouble. The first 128 Unicode code points are identical to the ASCII table that most 8-bit code pages are based on.
But the interpretation of \x80 and above may vary. A pure Unicode engine will treat this identical to \u0080, which represents a Latin-1 control code. But what most people expect is that \x80 matches the euro symbol, as that occupies position 80h in all Windows code pages. And it will when using an 8-bit regex engine if your text file is encoded using a Windows code page.
Since most people expect \x80 to be treated as an 8-bit character rather than the Unicode code point \u0080, some Unicode regex engines do exactly that. Some are hard-wired to use a particular code page, say Windows 1252 or your computer's default code page, to interpret 8-bit character codes.
Other engines will let it depend on the input string. Just Great Software applications treat \x80 as \u0080 when searching through a Unicode text file, but as \u20AC when searching through a Windows 1252 text file. There's no magic here. It matches the character with index 80h in the text file, regardless of the text file's encoding. Unicode code point U+0080 is a Latin-1 control code, while Windows 1252 character index 80h is the euro symbol. In reverse, if you type in the euro symbol in a text editor, saving it as UTF-16 will save two bytes AC 20, while saving as Windows 1252 will give you one byte 80.
If you find the above confusing, simply don't use \x80 through \xFF with a regex engine that supports Unicode.

Post a Comment