Regular Expressions Syntax
Literals
All characters are taken literally except the following:
".", "|", "*", "?", "+", "(", ")", "{", "}", "[", "]", "^", "$" and "\".
These characters have special meaning and must be preceded by a "\" to be taken literally.
Wildcards
The dot "." matches any characters including new line symbols [CR] and [LF].
Repeats
An expression followed by "*" can be repeated any number of times including zero.
An expression followed by "+" can be repeated any number of times excluding zero.
An expression followed by "?" can be repeated no more than one time.
The bounds "{" "}" may be used to specify number of repetitions:
"{N}" means that the expression must be repeated N times,
"{N,M}" means that the expression must be repeated N to M times.
Subexpressions and parenthesis
Parenthesis "(" ")" are used to mark subexpressions which which are counted starting from 1 from left to right.
Subexpression zero is the whole match of the expression.
Alternatives
Alternative expressions are separated by "|" or put on separate lines in the expression.
Line anchors
The empty string at the beginning of line is matched by "^" character.
The empty string at the end of line is matched by "$" character.
Text anchors
"\`" matches the start of the whole text.
"\A" matches the start of the whole text.
"\'" matches the end of a whole text.
"\z" matches the end of a whole text.
"\Z" matches the end of a whole text, or any new line characters at the end.
Character sets
The character set enclosed in brackets "[" "]" matches any symbol it contains,
for example "[abc]" matches either "a", "b" or "c".
Sets that start with "^" matches any character that is not member of the set,
for example "[^abc]" matches any character except "a", "b" and "c".
Character ranges can be specified as "[a-d]", which matches any symbol betweed "a" and "d".
Character classes are denoted by "[:class:]" within a set declaration.
Commonly used character sets are:
| [:alnum:] | Alpha numeric character. |
| [:alpha:] | Alphabetical character a-z and A-Z. |
| [:blank:] | Blank character, either a space or a tab. |
| [:cntrl:] | Control character. |
| [:digit:] | Digit 0-9. |
| [:graph:] | Graphical character. |
| [:lower:] | Lower case character a-z. |
| [:print:] | Printable character. |
| [:punct:] | Punctuation character. |
| [:space:] | Whitespace character. |
| [:upper:] | Upper case character A-Z. |
| [:xdigit:] | Hexadecimal digit character, 0-9, a-f and A-F. |
| [:word:] | Word character - all alphanumeric characters plus the underscore. |
| [:Unicode:] | Character whose code is greater than 255, this applies to the Unicode characters only. |
Character codes
The characters may be matched by octal code "\0NNN" or hexademical code "\xHH",
enclosed in brackets "{" "}" if necessary: "\0{NNN}" "\x{HH}".
Word operators
"\<" matches the null string at the start of a word.
"\>" matches the null string at the end of the word.
"\b" matches the null string at either the start or the end of a word.
"\B" matches a null string within a word.
The beginning of the text is a potential start of the word and the end of the text is a potential end of the word.
Back references
Subexpressions may be identified and the matched text used further in the expression by labels "\1" to "\9".
Miscellaneous escape sequences
| \w | Equivalent to [[:word:]]. |
| \W | Equivalent to [^[:word:]]. |
| \s | Equivalent to [[:space:]]. |
| \S | Equivalent to [^[:space:]]. |
| \d | Equivalent to [[:digit:]]. |
| \D | Equivalent to [^[:digit:]]. |
| \l | Equivalent to [[:lower:]]. |
| \L | Equivalent to [^[:lower:]]. |
| \u | Equivalent to [[:upper:]]. |
| \U | Equivalent to [^[:upper:]]. |
| \C | Any single character, equivalent to ".". |
| \X | Match any Unicode combining character sequence, for example "a\x 0301" (a letter a with an acute). |
| \Q | The begin quote operator, everything that follows is treated as a literal character until a \E end quote operator is found. |
| \E | The end quote operator, terminates a sequence started with \Q. |
| \a | Bell character 0x07. |
| \f | Form feed character 0x0C. |
| \n | Newline character 0x0A. |
| \r | Carriage return character 0x0D. |
| \t | Tab character 0x09. |
| \v | Vertical tab character 0x0B. |
| \e | ASCII Escape character 0x1B. |
| \0dd | An octal character code, where dd is one or more octal digits. |
| \xXX | A hexadecimal character code, where XX is one or more hexadecimal digits. |
| \x{XX} | A hexadecimal character code, where XX is one or more hexadecimal digits, optionally a Unicode character. |
| \cZ | An ASCII escape sequence control-Z, where Z is any ASCII character greater than or equal to the character code for '@'. |