|
|
Regular Expressions
A regular expression is a pattern that describes the format of a string.
In ProGrammar Grammar Definition Language,
regular expressions are delimited by single quotes.
Since regular expressions are considered to be grammar terms, they may occur
anywhere in a production rule that a term is legal. For example:
s ::= '[a-z]+' | some_other_symbol ;
Regular expressions consist of a combination of "regular" characters,
which are taken literally, and "meta" characters, which have special
meaning within the expression. A regular expression that is composed of
only regular characters is equivalent to a literal; e.g., the term
'foo bar'
is equivalent to the literal term:
"foo bar".
All letters and digits are interpreted literally, whereas most punctuation
characters have special interpretations. For example, the period matches
any single character.
|
'...'
|
matches any three characters
|
|
'a.b'
|
matches strings "aab", "abb", "axb", "a2b", etc.
|
In order to match a literal period, the period character must be preceded
by a backslash, which escapes its meta-interpretation.
|
'\..'
|
matches a literal period, followed by any character
|
|
'a\.b'
|
matches string "a.b" only
|
The following characters have special meaning within regular expressions:
|
.
|
A period matches any single character, except for NULL ('\0').
|
|
' '
|
Single quotes delimit a regular expression term.
|
|
" "
|
Double quotes declares a literal string within the expression. This string
must be matched exactly, following all the ordinary rules for literal
strings. Note that all special characters lose their meta-interpretations
within the literal string.
|
|
( )
|
Parentheses group one or more regular expressions together
as a single expression.
|
|
*
|
An asterisk matches zero or more occurrences of the expression that
precedes it. For example, 'a*' matches the strings "a", "aaaaaa", and ""
(empty string).
|
|
+
|
A plus sign matches one or more occurrences of the expression that
precedes it. For example, 'a+' matches the strings "a", and "aaaaaa"; but
not "" (the empty string), since at least one occurrence is required.
|
|
?
|
A question mark matches exactly zero or one occurrence of the preceding
expression. For example, 'a?' matches the strings "a" and "", and only
those strings.
|
|
[ ]
|
Square brackets delimit a character list, which matches any single
character in the list. For example, regular expression '[0123456789]'
matches any single digit. Within a character list, the following additional
meta-characters are defined:
|
-
|
The dash indicates a range of matching character values. For
example, '[0-9]' matches any single digit, and '[a-z]' matches
any single lower-case letter. The dash
is interpreted literally when it's the first or last character in the list.
|
|
^
|
When a caret is the first character in a character list
it's interpreted as a negation symbol, which matches any
character that is not in the list. For example, '[^abc]'
matches any character except for 'a', 'b' or 'c'.
|
All other meta-characters, except for the backslash,
lose their special interpretations when included in a character list,
and are taken literally.
Because the right square bracket delimits the
end of the character list, it must be escaped by a backslash when
included as part of the list; e.g. '[a-z\]]'
|
|
\
|
The backslash is the escape character, which overrides any special
meaning associated with the character that follows it. For example, '\['
is interpreted as a literal "[" (left square bracket) character, not the
beginning of a character list. Standard C escape-sequences are also
recognized (e.g., '\n' is interpreted as a newline).
|
Examples of Regular Expressions
|
alpha
|
::= '[a-zA-Z]+';
|
|
numeric
|
::= '[0-9]+';
|
|
alphanumeric
|
::= '[a-zA-Z0-9]+';
|
|
identifier
|
::= '[a-zA-Z_]+[a-zA-Z0-9_]*';
|
|
hex_number
|
::= '0[xX][a-fA-F0-9]+';
|
|
octal_number
|
::= '0[oO][0-8]+';
|
|
real
|
::= '-?(([0-9]*\.[0-9]+)
([eE][-+]?[0-9]+)?|([0-9]+))';
|
|