Shorthand Character Classes

Since certain character classes are used often, a series of shorthand character classes are available. \d is short for [0-9]. In most flavors that support Unicode, \d includes all digits from all scripts. Notable exceptions are Java,JavaScript, and PCRE. These Unicode flavors match only ASCII digits with \d.
\w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits. In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in \w. Again, Java,JavaScript, and PCRE match only ASCII characters with \w.
\s stands for "whitespace character". Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n\f]. That is: \s matches a space, a tab, a line break, or a form feed. Most flavors also include the vertical tab, with Perl (prior to version 5.18) and PCRE (prior to version 8.34) being notable exceptions. In flavors that support Unicode, \s normally includes all characters from the Unicode "separator" category. Java and PCRE are exceptions once again. But JavaScript does match all Unicode whitespace with \s.
Shorthand character classes can be used both inside and outside the square brackets. \s\d matches a whitespace character followed by a digit. [\s\d] matches a single character that is either whitespace or a digit. When applied to 1 + 2 = 3, the former regex matches  2 (space two), while the latter matches 1 (one).[\da-fA-F] matches a hexadecimal digit, and is equivalent to [0-9a-fA-F] if your flavor only matches ASCII characters with \d.

Post a Comment