CHAPTER 17
MISCELLANEOUS TOPICS IN PARSING
This chapter has three independent parts. The first shows a way by
which syntax errors can be reported for languages in general. The
second part introduces the ~scanner, a preprocessor which deals with
white space. Finally, we consider other parsing methods.
17.1
SYNTAX ERROR REPORTING
Errors are often committed while writing programs. Some errors have
to do with syntax. A comma or semicolon might be missing or misplaced.
Other errors have to do with semantics. A semantic error might be
a mispelling of a function name, or forcing a function to combine
data of types the function isn't prepared to handle.
We concentrate here on reporting syntax errors. Semantic, or ~datatype
errors are considered in Chapter 30.
The method works for all languages, whose rules of grammar are
unrestricted.
17.1.1
Strategy
Syntactically correct specifications have the property that the entire
text can be reduced to one, full-spanning part-of-speech. The
following correct (ICL) specification is rewritable into one, full-
spanning occurence of the part-of-speech <STATEMENT>:
WRITE(4); WRITE(5); WRITE(1+2*3); WRITE(6);
Sometime during the rewriting process, these four statements are
rewritten to:
<STATEMENT> <STATEMENT> <STATEMENT> <STATEMENT>
These are ultimately reduced to one <STATEMENT>, thanks to the rule:
<STATEMENT> <STATEMENT> -> <STATEMENT>
Suppose now that we introduce a syntax error, changing the "+" into
the meaningless "%":
WRITE(4); WRITE(5); WRITE(1%2*3); WRITE(6);
We can't quite interpret this as four <STATEMENT>s. Certainly, the
first, second, and fourth statements can be seen as <STATEMENT>s, but
the third can't. If we nonetheless try to complete the rewriting as
much as possible, we get:
<STATEMENT> WRITE( <EXPR> % <EXPR> ); <STATEMENT>
The first ~two statements can be reduced all the way to a single
<STATEMENT>. Similarly, the fourth can be rewritten to <STATEMENT>.
The third statement, the "WRITE(...);" visible above, can't be
rewritten into a <STATEMENT>. However, some rewrites can be carried
out. The "1" is rewritten into an <EXPR>, because it is a correct
<EXPR>. Also, the "2*3" can be rewritten to <EXPR>.
We propose that this latest line, which is reduced as far as possible,
~be the syntax error message. Upon looking at it, we see that much
valid rewriting has occured. The unrewritten part, the
"<EXPR> % <EXPR>" stands out. ~Erroneous ~parts ~stand ~out while
correct parts are compact. Upon seeing that the "<EXPR> % <EXPR>"
won't reduce, one might recall suddently that "%" can't combine two
<EXPR>s. (If one comes away believing that "%" is a valid operator,
one should try to find a syntax rule of the language which involves
"%" with a pair of <EXPR>s, which one then discovers doesn't exist).
We deliver as the error message the ~shortest possible phrase that
spans the entire input text.
Here is another syntax error:
WRITE( 1+2*3 ;
The shortest possible phrase spanning this input is:
WRITE( <EXPR> ;
In this partially compressed form, the missing close parenthesis
stands out.
Unfortunately, the shortest phrase might be misleading, as in the
following example. We would expect the text:
{ ABS(5) ; 6 % 7 }
to reduce to:
{ <EXPR> ; <EXPR> % <EXPR> }
There is however a shorter phrase:
{ <STATEMENT> <EXPR> % <EXPR> }
The "ABS(5);", ~including the semicolon, can syntactically be seen as
a <STATEMENT>, as though ABS were the name of a procedure (a non-value-
returning function).
The purpose of a syntax error message is to jiggle the mind of the
programmer so he or she can see the error. It is ~not to provide a
correction. If a system can correct errors, then that small set of
correctable errors could just as well be part of the language.
Usually, proposed corrections themselves are misleading.
This form of syntax error message has worked well during ICL's use.
The experienced ICL programmer can discover his or her errors
immediately. Newer users don't fare quite as well for a while. Some
knowledge of the rules of grammar is required. For example, the error
message:
WRITE( <EXPR> ;
makes sense only if you know that calls to functions (procedures)
require a close parenthesis as well as the open parenthesis. Also,
to interpret the error message involving the "ABS(5);" requires the
user to know that a "<STATEMENT>" might be an "<EXPR>;" is disguise.
With a little experience, however, these messages are effective.
BOX: What is the syntax error message?
BOX:
BOX: How could it be misleading?
17.1.2
Implementation
How do you go about finding the shortest full-spanning phrase?
One could work directly off of the CHOICE_OF_PHRASES structure.
However, there is an easier method that involves merely the
introduction of a small grammar.
We invent a new part-of-speech called SYNTAX_ERROR. We provide a
set of new rules involving SYNTAX_ERROR. Here is the first:
1) any_character <SYNTAX_ERROR> -> <SYNTAX_ERROR>
This rule by itself will render ~any sequence of characters, like the
input text, into a full-spanning SYNTAX_ERROR.
If we GIVE SYNTAX_ERROR on the righthand end of the input text, then
this rule will first apply by consuming the single, last character.
This rule's second application will then pick up the second to last
character. By applying this rule in such a right-to-left manner, the
entire input text will rewrite ultimately into one full-spanning
SYNTAX_ERROR.
This single-rule grammar acknowledges only characters, and not parts-
of-speech like EXPR or STATEMENT. We let SYNTAX_ERROR gobble up
EXPRs and STATEMENTs as well as characters. We introduce two more
rules similar to the first:
2) <EXPR> <SYNTAX_ERROR> -> <SYNTAX_ERROR>
3) <STATEMENT> <SYNTAX_ERROR> -> <SYNTAX_ERROR>
We might include two more:
4) <ID> <SYNTAX_ERROR> -> <SYNTAX_ERROR>
5) <QUANTIFIER> <SYNTAX_ERROR> -> <SYNTAX_ERROR>
All these rules indicate that we're willing to see to the left of
SYNTAX_ERROR, an EXPR, STATEMENT, ID, or QUANTIFIER.
Beyond the first rule, the others introduce ambiguities. For example,
figure 17.1 shows two possible parsings for the ID ALPHA when
SYNTAX_ERROR appears to its right. "ALPHA" can be seen as a single
ID, or as a sequence of five characters.
Of all possible SYNTAX_ERROR interpretations, we would like the
shortest one. For example, we want the interpretation where ALPHA
is an ID rather than the other, longer, five character interpretation.
17.1.2.1
The Semantics Of SYNTAX_ERROR
The ambiguity shown in figure 17.1 will appear as an ~OR-block under
the one full-spanning part-of-speech SYNTAX_ERROR (Chapter 15). It is
at that OR-block that we want to choose one interpretation over the
other.
Let's establish that the semantics of SYNTAX_ERROR is an action which
sets two global variables:
VAR MESSAGE = TEXT ;
COST = INT ;
The semantic action of SYNTAX_ERROR puts into MESSAGE the (compacted)
text over which the SYNTAX_ERROR spans. We will ultimately write the
variable MESSAGE to deliver the finished syntax error message. The
variable COST is set to the length of the text MESSAGE.
Our first rule, with semantics, is:
<ANY_CHARACTER[C]:X> <SYNTAX_ERROR:E> ->
<SYNTAX_ERROR:
<*E*>; "Put the given SYNTAX_ERROR's text into
MESSAGE, and put its length into COST."
MESSAGE:= C <$ MESSAGE; "Append our character
onto MESSAGE"
COST:= COST+1;
>
The same goes for any of the other four rules, such as the second:
<EXPR:X> <SYNTAX_ERROR:E> ->
<SYNTAX_ERROR:
<*E*>;
MESSAGE:= ' <EXPR> ' $$ MESSAGE ;
COST:= COST+1;
>
The semantics of each leaves in MESSAGE the compressed text spanned
by SYNTAX_ERROR. The latter rule augments MESSAGE with not the
original text, but the single item "<EXPR>". COST is left holding
the number of items in MESSAGE.
The choice to minimize COST is made at OR-blocks. Let's do the
following at OR-blocks. (This would replace the assignment to OR in
the definition of GIVE, Section 15.2.1). A and B are the two semantic
interpretations, such as the two interpretations put together under
the ~one full-spanning SYNTAX_ERROR.
OR:= //[A;B;]
BEGIN VAR M=TEXT; C=INT;
" Do the first interpretation... "
<*A*>; "Set COST and MESSAGE for one interpretation"
C:= COST; M:= MESSAGE; "Remember them"
" Do the second interpretation... "
<*B*>;
" Compare costs (COST and C). Set COST and MESSAGE
to the one, cheapest interpretation... "
IF C < COST THEN COST:= C;
MESSAGE:= M;
FI
" We have set COST and MESSAGE, as the semantics of
any SYNTAX_ERROR should. "
END \\ ;
Now, the OR-block under a SYNTAX_ERROR will choose the one
interpretation involving the ID, whose cost is 1, and not the other
interpretation whose cost is 5.
In summary, the following is performed if a given input text fails to
parse syntactically:
1) Activate the new syntax-error grammar
2) GIVE SYNTAX_ERROR to the right of the given text.
(This causes all the SYNTAX_ERROR rules to apply
appropriately).
3) Pick the one full-spanning SYNTAX_ERROR.
4) Invoke its semantics, which sets MESSAGE (and COST).
5) Print the contents of MESSAGE.
17.1.2.2
Supressing An Exponential Cost
Steps can be taken, as they are in Chapters 27 and 28, to avoid
invoking any SYNTAX_ERROR's semantics ~more ~than ~once. This
supresses an otherwise exponential computing cost down to a linear
cost.
Here, we can have the semantics of each SYNTAX_ERROR rule modify
itself objectively upon first invocation, as in:
... <SYNTAX_ERROR> ->
<SYNTAX_ERROR:
BEGIN VAR M=TEXT; C=INT;
<*E*>;
MESSAGE:= ... $$ MESSAGE ;
COST:= COST+1;
M:= MESSAGE; C:= COST;
@(SELF_):= //[M;C;]
MESSAGE:= M;
COST:= C; \\ ;
END
>
The semantics here operate as before, except for the final objective
modification:
@(SELF_) := ... ;
This final statement overwrites the block representing this rule
application's semantics. That block, SELF_, is overwritten by a new
BASIC_PROCESS which sets MESSAGE and COST. Upon non-first invocations,
the new program:
MESSAGE:= M;
COST:= C;
executes. It does not recompute, rather, it sets the two variables to
the results computed earlier, upon the ~first invocation.
17.2
THE SCANNER - DEALING WITH WHITE SPACE
"White space" refers to the characters space, tab, carriage-return, and
form-feed, or any combination of them. Those characters use no ink.
Our discussion of parsing up until now has ignored white space. For
example, "1+2" has no white space, and can be seen as a <FORM>.
However, "1 + 2" does contain white space, and thus far won't be seen
as a <FORM>. We will see the phrase:
<FORM> ~<SPACE> + ~<SPACE> <TERM>
but this will not trigger the "+" rule:
<FORM> + <TERM> -> <FORM>
Should we therefore introduce the rule:
<FORM> <SPACE> + <SPACE> <TERM> -> <FORM>
which does allow for white space? (We would also need to introduce
two more rules:
<FORM> <SPACE> + <TERM> -> <FORM>
and
<FORM> + <SPACE> <TERM> -> <FORM> ).
It is easier to have our program that takes user input simply ignore
white space. Section 12.4 introduced the taking of user input. We
now modify it:
C:= NIL;
FOR each input character CHAR
DO
IF CHAR is not white space THEN
LEFT:= C;
C:= NIL;
GIVE( [LEFT:LEFT POS:CHAR] );
FI
END
We GIVE only characters that are not white space.
17.2.1
White Space Does Affect Numbers And Names
We must be careful about throwing away white space. White space can
serve to separate two names or two numbers. For example:
1234 5678 is different from 12345678
and
ABRAHAM LINCOLN is different from ABRAHAMLINCOLN
If our taking of user input supresses white space, then those
distinctions will be lost, unless we delegate to the task of taking
user input the recognition of names and numbers.
We can afford the loss of white space if we remove from the grammar the
recognition of names and numbers, and place the reponsibility instead
on the ~scanner. The ~scanner is the process which takes user input,
including the recognition of names and numbers. The program fragment
just shown is a simple scanner, omitting the recognition of names and
numbers.
Our scanner will behave like the simple program just shown, except that
it will perform additional GIVEs, to introduce occurences of names and
numbers. An occurence of a name or a number will not replace its
underlying characters. Rather, a name or a number will be introudced
as alternative interpretations to the raw characters themselves
(figure 17.2).
For simplicity, we will say that a name, called an ID (short for
IDentifier), will consist only of the alphabetical characters A thru
Z. In general, IDs are usually allowed to contain digits as well,
and even the underscore character "_". The reader can perform the
generalization. Numbers will consist only of digits. (The minus sign
in a number will not be recognized by the scanner. That will be left
to a rule of grammar).
We introduce two Boolean variables IS_DIGIT and IS_ALPHA, to
characterize the input character. Our scanner is programmed as
follows:
C:= NIL;
FOR each input character CHAR
DO
IS_DIGIT:= '0' =< CHAR & CHAR =< '9' ;
IS_ALPHA:= 'A' =< CHAR & CHAR =< 'Z' ;
worry about identifiers
worry about numbers
IF CHAR is not white space THEN
LEFT:= C;
C:= NIL;
GIVE( [LEFT:LEFT POS:CHAR] );
FI
END
At the end of each iteration, C contains the non-white input character
as a PHRASE block.
Therefore, upon the start of the next iteration, we know that
C contains the previous character (left over from the previous
iteration). C will continue to reference the previous character while
we worry about identifiers and numbers. Only at the last minute do
we set LEFT to C and GIVE the current character.
17.2.2
Worrying About IDentifiers
Here is how we worry about identifiers. (A similar program fragment
will worry about numbers). We introduce three new variables:
VAR ID_TEXT = TEXT ; "The text of the identifier
seen up to now. "
ID_LEFT = CHOICE_OF_PHRASES ;
"This will be the LEFT field for the
finished identifier. It points to the
CHOICE_OF_PHRASES residing immediately
to the left of the identifier
(figure 17.2)."
WAS_ALPHA = BOOLEAN; "Was the ~previous character
alphabetical?
We "worry" as follows:
IF "we are beginning an identifier ..."
-WAS_ALPHA & IS_ALPHA THEN
ID_LEFT:= C; "Remember left neighbor to this new ID.
(Recall that C points to the previous,
non-alpha character)"
ID_TEXT:= {CHAR}; "The first character of the ID"
FI
IF "we are in the middle of an identifier"
WAS_ALHPA & IS_ALPHA THEN
ID_TEXT:= ID_TEXT $> CHAR;
"Right-append the next character onto
the ID's text"
FI
IF "we are beyond the end of an identifier"
WAS_ALPHA & -IS_ALPHA THEN
"GIVE the identifier..."
GIVE( [LEFT:ID_LEFT POS:<ID> SEM: ?? ] );
FI
WAS_ALPHA:= IS_ALPHA; "In preparation for the next iteration"
We've broken the capture of an ID into three parts, the beginning, the
middle, and the end. At the beginning, we set ID_LEFT to the
CHOICE_OF_PHRASES to the left of the present, first alpha character.
In the middle, we collect up the characters. At the end, we GIVE the
ID.
When we execute the GIVE, recall that C contains the previously
input character, the last character of the identifier. (We've just
taken the first character beyond the end of the identifier, a non-alpha
character). By GIVing the <ID> now, the <ID> resides on the
CHOICE_OF_PHRASES containing the <ID>'s last character, and so the <ID>
and its characters share the same span.
The part-of-speech of identifiers is <ID>, which we declare now:
POS ID : TEXT ;
The semantics of an ID is text, the text of the identifier. We will
come back to the specification of the SEM field within the call to
GIVE, the "??", in a moment. For completeness, we must set WAS_ALPHA
to FALSE at the very beginning, along with our scanner's first
assignment "C:=NIL;".
The semantics of the part-of-speech <ID> is TEXT, as we have declared.
Recall from chapter 4 that although we declare:
POS ID : TEXT ;
we actually get:
POS ID : //TEXT\\ ;
The semantics of <ID> is a ~process whose invocation delivers a TEXT.
Since the variable ID_TEXT contains the ID's text, we specify the
semantics, the "??" almost as:
// ID_TEXT \\
This is a process whose invocation will deliver the text.
It is not quite legal. Recall also from chapter 2 (and Section 23.6)
that we need to retain access to the value presently in the variable
ID_TEXT, so that in the future, when these semantics are invoked,
ID_TEXT will then hold what it now holds. We retain access to the
present ID_TEXT by writing the semantics legally as:
//[ID_TEXT;] ID_TEXT \\
As a whole, the GIVE appears as:
GIVE( [LEFT: ID_LEFT POS: <ID> SEM: //[ID_TEXT;] ID_TEXT \\]);
A similar treatment GIVEs <NUMBER>s when appropriate. Figure 17.2
shows an ambiguous phrase created by our scanner.
Finally, notice how white space separates <ID>s and separates
<NUMBER>s. IS_ALPHA is FALSE when given a blank character. This will
serve to terminate any <ID> in the making. While the blank character
terminates an <ID>, the blank itself is not GIVEn because the IF-
statement at the end of the FOR-loop. White space has thus been kept
around only long enough to separate identifiers and numbers, but not
to make it into the ambiguous phrase.
BOX: What service does the scanner perform?
BOX:
BOX: Where must white space be considered?
BOX:
BOX: Does the scanner introduce ambiguous phrases of its
BOX: own?
BOX:
BOX: Who consumes <ID>'s? (See Part 6 for examples).
17.3
OTHER PARSING METHODS
To round out our discussion of parsing, we present now other parsing
methods. These methods are in wide use. However, they are more
restricted than ours and require greater effort to implement because
they shun ambiguity.
17.3.1
Recursive Descent
Perhaps the most straightforward and simplest parsing method is
~recursive ~descent, which is workable for some context-free grammars.
Beyond relatively simple grammars however, this method can become
unwieldy, and might require something called ~backtracking.
Backtracking refers to the retreat of one train of thought in favor of
trying another. In general, backtracking can introduce an exponential
cost as a function of the length of the input string.
The delightful thing about recursive descent is how easy it appears to
transform rules of grammar into functions in a programming language.
We write one function for each (non-terminal) part-of-speech. For
example, consider the rules:
<FORM> ::= <FORM> + <TERM>
<FORM> ::= <TERM>
For a given part-of-speech, e.g., <FORM>, we write one function, called
FORM, which represents all the rules whose give-phrases are <FORM>
(The give-phrase resides to the left of the "::="). The FORM function
is to be called whenever we expect to see a <FORM> next. Similarly,
we call the function TERM whenever we expect to see a <TERM>.
For example, consider now just the first rule. We transform it into a
function that looks almost the same:
DEFINE FORM:
FORM;
IF OK & CHAR = '+' THEN
NEXT_CHAR;
TERM;
ELSE OK:= FALSE; FI
ENDDEFN
It says that in looking for a <FORM>, see first a <FORM> (the first
line), and if that succedes (OK), then insist that the next character
(CHAR) is a "+". Given that, consume the "+" (NEXT_CHAR) and proceed
to find a <TERM>. If this scenario is impossible, then set OK to
FALSE. Each such function has the option to fail, simply by setting OK
to FALSE.
Unfortunately, this function FORM will call itself before doing
anything else. That call to FORM will again call FORM, ad infinitum.
This infinite recursion is a sympton of the rule's own ~left
~recursion. A rule is ~left ~recursive if the first part-of-speech in
the want-phrase (that which is to the right of the "::=") matches the
rule's give-phrase (left of "::="). The straightforward implementation
of left recursive rules always causes infinite recursion.
We avoid left recursion by rewriting the rule instead as:
<FORM> ::= <TERM> + <FORM>
This rule has no left recursion, although it does have right
recursion (the <FORM> on the far right). Right recursion is harmless,
as in the following rendition of this rule:
DEFINE FORM:
TERM;
IF OK & CHAR = '+' THEN
NEXT_CHAR;
FORM;
ELSE OK:= FALSE; FI
ENDDEFN
This causes no infinite recursion because the call to FORM happens
only after consuming at least one character (the "+"). Therefore,
the recursion can go only as deep as the number of characters in the
input string.
We encapsulate both rules:
<FORM> ::= <TERM> + <FORM>
<FORM> ::= <TERM>
by writing:
DEFINE FORM:
TERM;
IF OK & CHAR = '+' THEN
NEXT_CHAR;
FORM;
FI
ENDDEFN
Surprisingly, we have removed the ELSE caluse and simultaneously
accomodated the second rule. That is, if the character following the
first TERM is not a "+", then we don't consume it (NEXT_CHAR).
Instead we just succeed, having performed now the second rule.
Similarly, the rules:
<TERM> ::= <ATOM> * <TERM>
<TERM> ::= <ATOM>
map to:
DEFINE TERM:
ATOM;
IF OK & CHAR = '*' THEN
NEXT_CHAR;
TERM;
FI
ENDDEFN
Finally, the rules:
<ATOM> ::= ( <FORM> )
<ATOM> ::= <NUMBER>
map to:
DEFINE ATOM:
IF CHAR = '(' THEN
NEXT_CHAR;
FORM;
IF CHAR = ')' THEN
NEXT_CHAR; " and succeed "
ELSE OK:= FALSE; FI
ELSE NUMBER; FI
ENDDEFN
For recursive descent to work without backtracking, it must be possible
to decide which ~one of the <FORM> rules to pursue just by looking at
the next character (CHAR), or possibly several characters ahead. The
presence or absence of a "(" make the decision easy for the <ATOM>
rules, in the function ATOM.
Fortunately, with the <FORM> rules (and also the <TERM> rules), there
is no need to distinguish the alternatives, because both alternatives
begin with <TERM> (or <ATOM> for the <TERM> rules). Only after the
<TERM> is consumed do we need to distinguish the alternatives. The
presence or absence of a "+" makes the distinction. We are lucky with
this grammar. Notice that the FORM function consumes ~as ~much ~as
~possible, taking in as many "+"s as possible.
Our good fortune breaks down if we introduce a rule like:
<?> ::= <FORM> + <NUMBER> %
FORM's maximal consumption of "+"s would consume the final "+<NUMBER>"
in:
<NUMBER> + <NUMBER> %
FORM's greediness will never let our latest rule apply, even when it
is theoretically necessary, as in this latest phrase.
As grammars contain more and more rules, these problems multiply.
(For example, our model language ICL has nearly 150 rules). Imagine
the difficulty in introducing a new rule using recursive descent.
With our ambiguous method, in contrast, adding a new rule is a trivial
effort.
Backtracking can be introduced so as to allow the trial of more than
one alternative. One can pursue one alternative until failure is
discovered, and then pursue another alternative. Unfortunately, while
pursuing one alternative, another occurence of backtracking might be
necessary. Within each of those alternatives, backtracking may occur
even again.
Backtracking within backtracking, etc., will wind up considering in
total an exponential number of alternatives, and hence building up
an exponential cost. Assuming that there is always a choice of two
alternatives, a bifurcation can occur at least as often as one per
character in the given input string, giving rise to 2-to-the-N'th
alternatives where N is the length of the input string.
BOX: How are rules of grammar transformed into programs
BOX: in recursive descent?
BOX:
BOX: When might it be impossible to translate successfully
BOX: rules into recursive descent programs?
BOX:
BOX: Why could backtracking have an exponential cost?
17.3.1.1
Semantics In Recursive Descent
Let's introduce semantics at least for the <FORM> rules. We assume
now that any call to FORM, TERM, ATOM, or NUMBER leaves the semantic
value in the new variable ANSWER. Here is the FORM function with
semantics:
DEFINE FORM:
TERM;
IF OK & CHAR = '+' THEN
NEXT_CHAR;
X:= ANSWER; "the TERM's value"
FORM;
ANSWER:= X + ANSWER; "the sum of TERM and
FORM"
FI
ENDDEFN
If after the TERM there is no "+", then the ANSWER as left by the TERM
acts also as the ANSWER for our <FORM>
If we were implementing "-" instead of "+", the specification of
semantics gets more difficult. Due to our conversion to ~right
~recursion, the expression
1-2-3-4
will group as
1-(2-(3-4))
Because "-" is not associative, this answer is different from the
answer we expect:
((1-2)-3)-4
To get around this problem, we can no longer have ANSWER represent a
number.
However, by having ANSWER represent a ~list ~of ~numbers, our "-"
semantics need not perform the subtractions immediately. The list of
numbers 1,2,3,4 can be formed, and the subtractions can occur after
the parsing, since we can traverse that list in any order, an order
not dictated by our parser's execution that groups right-to-left.
The complications of semantic specification imposed by the
massaging of the grammar need to fit recursive descent makes recursive
descent difficult if not impossible to use. It is not practical for
parsing languages in general.
BOX: Are semantic specifications easily introduced into
BOX: recursive descent programs in general?
BOX:
BOX: What is left recursion?
17.3.2
LR(k) Parsing
For some context-free grammars, it is possible to render a parser which
always applies the right rule at the right time. For these parsers,
there is no need to maintain multiple possibilities during parsing.
The most general of these parsers are called LR(k) parsers[].
At any point in the input string, the parser chooses what rule to
apply, or to apply none, based on the characters seen up to this point
and by looking ahead to the right by k characters or takens. In
practice, k is zero or one. These parsers have the advantage of
parsing in linear time always.
LR(k) parsing never works for ambiguous grammars, and hence would
forbid ICL expressions like:
A+B#C
where the grouping is determined only after syntax processing is done.
Also forbidden would be a ~datatype grammar which allows "3" to be
seen as either an integer or a real. Not only are ambiguous grammars
forbidden, many unambiguous grammars can't be handled by LR(k) either.
It is often useful to form new languages by taking a combination of
existing languages. Even if two given languages are each LR(k), (that
is, they can each be parsed by an LR(k) parser), their union, the
language that accepts notations in either language, may not be LR(k).
There is no easy way to find out if a language can be implemented by
LR(k), other than by feeding it to an LR(k) parser and seeing if it
complains upon receipt of the grammar. Modifications to the grammar
may be necessary, and that may complicate the specification of
semantics, leading to less clear language definitions.
The LR(k) parsing method automatically generates tables to represent
the grammar and to provide for the certain application of rules. These
tables may grow large, and to combat that, less powerful but smaller
variations of LR(k) have been developed. Felt to be the most
practical is the variation called LALR(k), which is used by the popular
YACC (Yet Another Compiler Compiler) from UC Berkeley.
Some people feel that the limitations imposed on languages by LR(k)
parsing are insignificant with regards to ~programming ~languages.
This attitude may reflect an assumption that programming languages
can't evolve far from where they are now.
It may be that progress in the developement of new programming
languages has been stunted by this assumption. We certainly can't
know for sure until unrestricted methods are employed for some time,
for the easy and flexible specification of programming languages that
take innovative leaps towards human convenience (and hence the
reduction of bugs).
17.3.3
Earley's Efficient Context-Free Parser
Earley's parser[] works for all context-free grammars as does our
general parser. Given LR(k) grammars, Earley's parser works in linear
time, like the LR(k) method itself. Furthermore, the ~union of two
(or any finite number of) LR(k) grammars, which itself might not be
LR(k), will still be parsed in linear time. In general, Earley's
parser can consume as much as N^3 time and memory given an input
string of length N. Our general parser may consume N^4, but handles
some general rewrite rules. As shown earlier (Section 14.4.6), our
parser can be modified to work in N^3 time if we limit ourselves to
only context-free grammars.
Earley's parser supports ambiguity as ours does, but it does so in a
different way. Where we maintain all possible ~rewriten ~strings
upon the input string, Earley instead maintains the set of possible
~rules that could conceivably be applying at each point in the input
string. This makes Earley's parser ~top-down, as it keeps track of
rules, whereas ours is ~bottom-up in that it keeps track of variations
on the input string.
We will employ our general parser, with its capability for general
rewrite rules, in processing semantics as well as syntax.
As we shall see in Part 7, semantics will procede by generating
phrases in another language, phrases of length greater than one,
as though the results of general rewrite rules (e.g., figure 25.9).
Also, Section 25.5 will show that syntactical ambiguities in the
(first) language will give rise to ambiguous ~phrases in the second
language. Further ambiguities may be introduced within the second
language itself. To have ~both kinds of ambiguity working together,
the ambiguity must be associated with the ~phrases and not the possible
rules that may be applicable. Thus, to process ambiguity in semantics,
our general parser applies whereas Earley's might not.