|
|
GDL Tutorial - The Phonebook Example
The Phonebook grammar parses text files that contain a list of names and
phone numbers, such as:
Adams, John Q. (900) 555-1234
Smith, John Q. (800) 765-4321
Doe, John (999) 999-9999
Doe, Jane J. (000) 012-3456
A grammar is made up of a list of production rules. Each rule contains
the name of a symbol, followed by the production symbol "::=", followed by the
definition of the symbol, followed by a semi-colon. Named symbols are also
called "nonterminal" symbols, since they are defined in terms of other symbols.
The basic approach for defining a grammar is to start with a high-level
description of the input, then incrementally define each symbol in terms of
other symbols, until the lowest level symbols are defined by terminals.
Terminals are primitive types that the parser recognizes implicitly and
already "knows" how to parse. Examples include literal strings,
regular expressions,
and any of the predefined types such as "alpha" and "numeric". Once all of the
symbols in a grammar have been defined in terms of other symbols or terminal types,
the grammar definition is "complete" and may be used for parsing.
The following grammar describes the syntax of this file.
PhoneBook ::= { PhoneBookLine };
PhoneBookLine ::= Name PhoneNumber;
Name ::= Lastname
","
Firstname
[MiddleInit "."] // optional term
;
Lastname ::= alpha; // one or more letters
Firstname ::= alpha; // one or more letters
MiddleInit ::= alpha; // one or more letters
PhoneNumber ::= "(" AreaCode ")"
PhonePrefix
"-"
PhoneSuffix ;
AreaCode ::= numeric<3>; // exactly three digits
PhonePrefix ::= numeric<3>; // exactly three digits
PhoneSuffix ::= numeric<4>; // exactly four digits
The first production rule:
PhoneBook ::= { PhoneBookLine };
uses a repeater construct (delimited by '{' and '}') to indicate that
multiple occurrences of PhoneBookLine may exist in the input.
PhoneBookLine is subequently defined as a conjunction of
two other symbols:
PhoneBookLine ::= Name PhoneNumber;
Because these names
are separated by whitespace, they are considered conjunctive. Conjunction
is analogous to a logical AND, indicating that both terms occur
in the input, in the order they are listed. In this case, each occurrence
of PhoneBookLine consists
of an occurrence of Name, followed by an occurrence of PhoneNumber.
In the next production:
Name ::= Lastname
","
Firstname
[MiddleInit "."] // optional term
;
the two quoted strings ("," and ".") are called literals.
Literals must match characters from the input exactly. The square brackets
around the last term indicate that the enclosed conjunction of
MiddleInit and "." is optional.
The next set of productions defines three symbols:
Lastname ::= alpha; // one or more letters
Firstname ::= alpha; // one or more letters
MiddleInit ::= alpha; // one or more letters
Each of these symbols is defined as a symbol named "alpha".
Keyword alpha is a predefined type, meaning that the parse engine
already knows how to parse it. The predefined types in ProGrammar are:
|
Predefined type
|
Contains characters of type...
|
|
alpha
|
upper- and lower-case letters
|
|
alpha_
|
upper- and lower-case letters; the underscore ('_')
|
|
alphanumeric
|
upper- and lower-case letters; digits
|
|
alnumblank
|
upper- and lower-case letters; digits; whitespace
|
|
identifier
|
upper- and lower-case letters; digits; the underscore ('_'). The first character cannot be a digit.
|
|
numeric
|
digits
|
|
quotedstring
|
any string of characters enclosed by quotation marks
|
|
whitespace
|
spaces, tabs, newlines, carriage-returns
|
The predefined type numeric is used in the following production rules:
AreaCode ::= numeric<3>; // exactly three digits
PhonePrefix ::= numeric<3>; // exactly three digits
PhoneSuffix ::= numeric<4>; // exactly four digits
Note the use of length constraints, denoted by "<" and ">". This construct
limits the minimum and maximum length, in character positions, of the term
that precedes it. The generalized notation for a length constraint is:
any-term < min-length, max-length >
In the preceding production rules, the length constraints are interpreted as follows:
numeric<3> exactly three digits
numeric<4> exactly four digits
By default, the minimum length of a term is one, and there is no maximum length.
There are several usage variations for length constraints, as shown in the
following examples:
|
Usage
|
Min Length
|
Max Length
|
|
numeric <3, 4>
|
3
|
4
|
|
numeric <3, >
|
3
|
unbounded
|
|
numeric <3>
|
3
|
3
|
|
numeric < , 4>
|
0
|
4
|
|
numeric < , >
|
0
|
unbounded
|
|
numeric
|
1
|
unbounded
|
The following table summarizes the GDL constructs discussed in this example:
|
GDL Construct
|
Notation
|
Description
|
|
Repeater
|
{ repeat-term }
|
Indicates that a term may have multiple successive occurrences in the input.
|
|
Literal
|
"some string"
|
A value that must match the input exactly, in order to parse successfully.
|
|
Conjunction
|
implicit
|
Operates like a logical-AND. All terms in a conjunction must
match the input, in the order they are listed, for the parse to succeed.
|
|
Optional Term
|
[ ]
|
Indicates that the enclosed term is optional in the input.
|
|
Length Constraint
|
< min, max >
|
Limits the minimum and maximum length, in character positions, of a term.
|
|
Predefined Type
|
type name
|
Any of the predefined types; including alpha, numeric, and alphanumeric.
|
|