CHAPTER 16
EFFICIENCY ENHANCEMENTS
This chapter introduces two independent ways to enhance the efficiency
of parsing. The first, which introduces the notion of ~deterministic
rules, makes little change in our parser, except to introduce another
GIVE procedure called D_GIVE. Deterministic rules serve to ~save
~memory, which can be important for very long input strings. (However,
it is unclear how important deterministic rules are. For all inputs
ever given to ICL, none caused a memory overflow even
in the absence of deterministic rules).
The second way to enhance efficiency relieves CPU consumption. It
incurs more extensive modifications to the parser. The increased
speed is most significant for large grammars, grammars having many
rules.
16.1
DETERMINISTIC RULES
We consider here a way to reduce the memory consumption from parsing
very long strings.
Program specification usually consists of a set of function
declarations, and declarations for new global variables and new
types. The bulk of the specification declares functions, like:
DEFINE Function1 ...
...
ENDDEFN
DEFINE Function2 ...
...
ENDDEFN
...
The function definitions may together form a long specification, but
individual function definitions may be of small or moderate size.
We notice that our syntax for function definition:
DEFINE ... ENDDEFN -> <DECLARATION>
has keywords (DEFINE and ENDDEFN) that clearly deliniate the start
and finish of the function definition. We may feel that when this
rule applies, we are certain that its application was meant to be.
That is, we don't expect the text under this want-phrase to be part
of any other construction.
We can assert the strength of this rule by making it ~deterministic.
When the rule applies, we'll have it actually erase its want-phrase,
and all the phrases over which it spans. We don't have the rule
propose its give-phrase as an alternative, instead we have it be the
~only alternative. This saves memory.
Figure 16.1(a) shows this rule applying naturally, where the
<DECLARATION> is an alternative. As a deterministic rule, it would
leave things as shown in figure 16.1(b). It would also transform
figure 16.1(c) into (d).
It may be alarming to see us throw away all the parsing activity
between the DEFINE and ENDDEFN. Don't worry. The ~meaning associated
with the resulting <DECLARATION> still is intact.
Some of the semantics associated with all the phrases we let go may
still be referenced within the semantics of the remaining
<DECLARATION>. All that erased parsing thus continues to benefit us,
via the semantics that it built up.
Deterministic rules save memory and possibly execution time. A
deterministic rule releases all the memory used for sub-phrases.
Imagine how much less memory is consumed in figure 16.1(d) relative
to (c). (A periodic process called garbage collection will actually
recycle the released memory for other uses (Part 8)).
ICL has this function definition rule be deterministic. It reduces
the memory consumption when parsing an input consisting
of many function definitions. As each one is seen, it collapses to
a single occurence of <DECLARATION>. Thus, only one function, inside
of which we stand, is consuming significant memory at all.
16.1.1
Pitfalls
Consider the following program specification. It uses a variable
named ENDDEFN:
DEFINE FUNCTION1...
...
I:= I+1;
ENDDEFN:= I;
...
You may notice what looks like a complete function definition within
this text (formatted now slightly differently):
DEFINE FUNCTION1...
...
I:= I+1;
ENDDEFN
:= I;
...
If the function declaration rule is deterministic, then it will see
this inner function definition first and will leave our ambiguous
phrase as:
<DECLARATION>
:= I;
...
This phrase will never be part of a complete program specification.
The deterministic rule has thus misled us. If the rule were not
deterministic, it would still apply upon this inner function
definition, but it would leave intact all other interpretations.
Still available would be the whole function definition interpretation:
DEFINE FUNCTION1...
...
I:= I+1;
ENDDEFN:= I;
...
ENDDEFN
The final ENDDEFN would participate in a second application of this
function definition rule. Both applications would be available,
and only the second (whole) one would become part of an overall
interpretation.
These pitfalls are minor, and rare, and can be removed easily by
choosing a different variable name, or by enclosing in parentheses
the ENDDEFN variable. (The parentheses would block the inner
application of the function definition rule). Some programming
languages solve this problem by forbidding the use of such "reserved
words".
BOX: When might you render a rule deterministic?
BOX:
BOX: What can go wrong?
16.1.2
Implementation
We implement deterministic rules by introducing a variation of GIVE,
called D_GIVE, short for ~deterministic ~GIVE:
DEFINE D_GIVE( P: PHRASE ):
C:= D;
GIVE(P);
D:= C;
ENDDEFN
D_GIVE is to be called instead of GIVE, inside deterministic rules.
We've introduced a global variable named D, beyond the global variable
C. D, like C, is a CHOICE_OF_PHRASES. D_GIVE ignores entirely the
value in C. It uses the value in D instead.
Think of D and C as queues. D is the high priority queue and C
is the low priority queue. Deterministic rules introduce their
phrases on the high-priority queue D instead of the low-priority
queue C, where all other rules introduce their phrases.
To make D have priority over C, we should impose the following:
If you read C, then read D instead, if D is non-empty.
D dominates C.
We read C inside general rewrite rules and while taking user input,
via:
LEFT:= C;
We need to change that assignment to be:
LEFT:= IF DEFINED(D) THEN D ELSE C FI ;
This means that we see as our left neighbor D if D is defined at all
(non-empty). Otherwise we do like before, taking as our left neighbor
C.
We need another change. We must set D to NIL whenever we set C
to NIL. That is, our old act of:
LEFT:= C;
C:= NIL;
GIVE( [LEFT:LEFT POS:<pos>] );
is now updated to be:
LEFT:= IF DEFINED(D) THEN D ELSE C FI ;
C:= NIL;
D:= NIL;
GIVE( [LEFT:LEFT POS:<pos>] );
Figure 16.1(e) shows the values in C and D after GIVing the ENDDEFN.
Since our function definition rule is deterministic, it puts its
give-phrase onto D instead of C.
The value in D will be taken when we step rightward, when we execute:
LEFT:= IF DEFINED(D) THEN D ELSE C FI ;
Since D is non-empty, LEFT now points to what D pointed to. Only the
<DECLARATION> is in D. Everything that was in C is lost. Notice that
after stepping right once, we find only the <DECLARATION> to our left,
as shown in figure 16.1(f).
BOX: Which is the high-priority queue, C or D?
BOX:
BOX: What do we do with C if D is non-empty?
16.1.2.1
Do Deterministic Rules Preserve "Fully Ambiguated"?
We might argue that D is always fully ambiguated, like C was, before
introducing deterministic rules. If only ~one deterministic rule
applies, then D will be fully ambiguated:
D is initially NIL, which means it's initially fully
ambiguated. D_GIVE calls GIVE, first putting into C the
contents of D, which are fully ambiguated. The call to GIVE
leaves C still fully ambiguated, and that value is put back
into D.
D is fully ambiguated. However, C is no longer fully ambiguated,
because we haven't put ~everything onto C. Things in D are not in C.
However, when we perform:
LEFT:= IF DEFINED(D) THEN D ELSE C FI ;
LEFT will always be fully ambiguated. If LEFT takes on D, then we
already know that D is fully ambiguated. However, if D is empty,
D_GIVE was never called, and hence C got everything as before,
via all non-deterministic rules. Thus, C, and hence LEFT, is fully
ambiguated even if LEFT takes on C's value.
What happens if more than one deterministic rule applies? That is,
what if D_GIVE calls GIVE, and within that GIVE, D_GIVE gets called
again? This happens if the results of one deterministic rule
application participate in another deterministic rule application.
The second (nested) call to D_GIVE naturally ignores what's on C,
because it performs first:
C:= D;
However, what was on C at this moment? The outer (original) call to
D_GIVE was executing:
C:= D;
GIVE( P ); "we are waiting here now"
D:= C;
It was still at the GIVE, waiting for its completion. At that time,
the first deterministic rule's give-phrase was on C, as we still
hadn't yet moved that result into D (the last assignment).
Thus, among the losses in C is the give-phrase given by the first
deterministic rule! The first deterministic rule thus fails to be
represented.
Our argument that D is always fully ambiguated falls apart. We had
assumed that D_GIVE's call to GIVE meant a call to the old GIVE,
without the possiblity of further deterministic rule applications.
We therefore conclude that D_GIVE as is doesn't preserve the "fully
ambiguated" property.
16.1.2.2
An Updated D_GIVE Function
We recover the "fully ambiguated" property for D_GIVE, by making D_GIVE
behave exactly like GIVE when inside an (outer) call to D_GIVE.
We introduce a global variable named WORKING_ON_D:
VAR WORKING_ON_D = BOOL;
WORKING_ON_D:= FALSE;
D_GIVE will set this variable to TRUE. D_GIVE is after all
concentrating on D now.
If we enter D_GIVE while WORKING_ON_D is true,
we know that our call to D_GIVE is called indirectly within an outer
call to D_GIVE. In this case, we act exactly like GIVE (in the THEN-
clause):
DEFINE D_GIVE( P:PHRASE ):
IF WORKING_ON_D THEN GIVE(P); "(D_GIVE acts exactly
like the old GIVE)"
ELSE C:= D;
HOLDING WORKING_ON_D:= TRUE;
DO
GIVE(P);
ENDHOLD
D:= C;
FI
ENDDEFN
This new definition assures that within D_GIVE, further calls to
D_GIVE behave exactly like GIVE.
Now we know that the call to GIVE within the ELSE-clause behaves
exactly like GIVE in the absence of deterministic rules. This assures
that D, which is set to the result of GIVE, is fully ambiguated.
BOX: If a deterministic rule applies and this causes another
BOX: deterministic rule to apply, what does D_GIVE perform
BOX: like for the second rule application?
16.1.3
Summary
Deterministic rules are helpful with very long inputs. They
correspond to the language designer's feelings that those
deterministic rule applications are certain events.
The only restrictions that arise concern avoiding use of a name like
ENDDEFN, which is a mild enough restriction. Even if the user uses
the name ENDDEFN, any very rare errors caused by the deterministic
rule will show up clearly in the syntax error message.
Deterministic rules make it impossible to prove the parser's
completeness, as it should be, because deterministic rules do remove
some options. Even portions of the given input string are removed
by deterministic rules.
With heavy use of ICL since 1977, we have never encountered a file of
ICL text that was big enough to make significant the savings provided
by deterministic rules. However, we keep the deterministic rule for
robustness, for as yet unseen large files.
16.2
SPEEDING UP THE PARSER
We now embark on some refinements of the parser. They don't reduce
the theoretical polynomial upper bound, but they do reduce the
coefficients, particularly when given large grammars. The following
are independent optimizations.
16.2.1
Columns
Consider the type CHOICE_OF_PHRASES. Our figures have represented
that type as an array, like in figure 16.2(a). Part (b) shows how ICL
actually represents the type CHOICE_OF_PHRASES. Instead of a single
block, ICL maintains a list, shown vertically in the figure. That is,
the type declaration:
TYPE CHOICE_OF_PHRASES = { PHRASE } ;
makes CHOICE_OF_PHRASES be a linked list, and not an array.
This choice is appropriate for two reasons. First, we access a
CHOICE_OF_PHRASES only sequentially, via the FOR-quantifier. Arrays
are truly needed only if random access is important, which it isn't
here. Secondly, the right-append operator "$>", which appears in
GIVE's assignment:
C := C $> P ;
is defined only for lists, and not for arrays. The ~number of
elements in C increases.
If we replace this assignment by:
C := P <$ C ;
the new PHRASE block P will be inserted at the front of the list, as
shown in figure 16.2(c). This is a very inexpensive operation.
Although ICL implements both of these assignments equally quickly, it
is easier to understand things if we use this latest left-append "<$"
instead of the former right-append "$>".
Now that the actual representation of CHOICE_OF_PHRASES is shown, we
go one step further. We introduce a new field into the type PHRASE,
called the ALTernate:
TYPE PHRASE = [ LEFT: CHOICE_OF_PHRASES POS: INT
SEM: BASIC_PROCESS
ALT: PHRASE ] ;
We now make CHOICE_OF_PHRASES be simply a PHRASE:
TYPE CHOICE_OF_PHRASES = PHRASE ;
Figure 16.2(d) shows our old CHOICE_OF_PHRASES rendered this new way.
That list of PHRASE blocks, linked together via the ALT field, is
called a ~column, and it plays exactly the same role that
CHOICE_OF_PHRASES has played up to now. The introduction of a new
element onto such a column is easy like in figure 16.2(c), and is shown
in figure 16.2(e).
This slight modification to our datatypes speeds things up a little.
We will extend this modification shortly.
BOX: Why is it quicker to introduce a new element onto a
BOX: list than onto an array?
16.2.2
Want-Trees
Consider the two rules:
<FORM> ::= <FORM> + <TERM>
and
<FORM> ::= <TERM>
(We are using the "::=" notation, which was introduced at the
beginning of Chapter 1).
These rules' want-phrases (the righthand sides) end in
the same thing, <TERM>. Written separately, these rules appear as:
P1:= P;
IF P1.POS = <TERM> THEN
FOR P2 $E P1.LEFT; WITH P2.POS = '+' ;
!!
FOR P3 $E P2.LEFT; WITH P3.POS = <FORM> ;
DO
give a <FORM>
END
FI
P1:= P;
IF P1.POS = <TERM> THEN
give a <FORM>
FI
The IF statements, which match the want-phrases' rightmost elements,
the <TERM>s, appears twice. It will be executed twice. We optimize
by combining those two rules so as to share one occurence of the IF
statement:
P1:= P;
IF P1.POS = <TERM> THEN
" First rule... "
FOR P2 $E P1.LEFT; WITH P2.POS = '+' ;
!!
FOR P3 $E P2.LEFT; WITH P3.POS = <FORM> ;
DO
give a <FORM>
END
" Second rule... "
give a <FORM>
FI
The one IF statement serves both rules. The actions of the two rules
remain unchanged.
We write this factoring symbolically for the two rules as:
"give a <FORM>" <-- <FORM> <--- + <--- <TERM>
/
"give a <FORM>" <---/
The rightmost <TERM> is shared in common. We call such a combined
representation of want-phrases as a ~want-tree. The root of the tree
appears at the far right.
Consider the following two rules:
?1 ::= <FORM> + <TERM>
?2 ::= <FUM> + <TERM>
(We've made up the part-of-speech <FUM> for this example). The two
rules share more in common than did our previous example. Looking at
the want-phrases from right-to-left, both the <TERM> and the "+" are
shared in common. We represent the sharing by writing the want-phrases
as:
do ?1 <--- <FORM> <--- + <--- <TERM>
/
do ?2 <--- <FUM> <-/
This sharing can appear in the program for the two rules:
P1:= P;
IF P1.PO2 = <TERM> THEN
FOR P2 $E P1.LEFT; WITH P2.POS = '+' ;
DO
" First rule... "
FOR P3 $E P2.LEFT; WITH P3.POS = <FORM> ;
DO
?1
END
" Second rule... "
FOR P3 $E P2.LEFT; WITH P3.POS = <FUM> ;
DO
?2
END
END
FI
Each setting of the variables P1 and P2 applies to both rules. Only
the variable P3 is treated differently by the two rules.
If we put all three rules together:
?1 ::= <FORM> + <TERM>
?2 ::= <FUM> + <TERM>
?3 ::= <TERM>
we can share portions of their want-phrases as follows:
do ?1 <--- <FORM> <--- + <--- <TERM>
/ /
do ?2 <--- <FUM> <-/ /
/
do ?3 <---/
Efficiency is gained as rules' want-phrases can be shared.
In fact, ~two occurences of the ~same rule would wind up costing
only as much as one occurence, as far as phrase matching is concerned.
That is, the two rules:
?1 ::= <FORM> + <TERM>
?2 ::= <FORM> + <TERM>
will look like:
do ?1 <--- <FORM> <--- + <--- <TERM>
/
do ?2 <-/
These two rules will behave like one except that two give-
phrases (?1 and ?2) will be generated.
If the (context-free) give-phrases are identical, then the second of
the two calls to GIVE will find a duplicate. Hence, only one of the
applications of these two identical rules will cause the grammar to be
called. Two identical rules therefore cost only the price of one,
except for the tiny cost within GIVE, as it finds the duplicate.
BOX: What savings are afforded by want-trees?
BOX:
BOX: With our optimizations, what is the difference in cost
BOX: between one rule vs. two identical occurences of that
BOX: rule?
16.2.3
Scanning A Column Only Once For Several Rules
Consider the want-tree:
do ?1 <--- <FORM> <--- + <--- <TERM>
/
do ?2 <--- <FUM> <--/
To the left of the "+", a <FORM> and a <FUM> are desired. As just
shown, after the right-to-left matching of the <TERM> and the "+",
the two rules separate. Each sets the variable P3 independently:
" First rule... "
FOR P3 $E P2.LEFT; WITH P3.POS = <FORM> ;
DO
?1
END
" Second rule... "
FOR P3 $E P2.LEFT; WITH P3.POS = <FUM> ;
DO
?2
END
Here we go thru two scans of P2.LEFT, examining the same column
(CHOICE_OF_PHRASES) once for each rule.
It is more efficient to scan that column only once. We scan P2.LEFT
only once in the following equivalent rendition:
FOR P3 $E P2.LEFT;
DO
" First rule... "
IF P3.POS = <FORM> THEN ?1 FI
" Second rule... "
IF P3.POS = <FUM> THEN ?2 FI
END
In general, a point in a want-tree may have more than two alternatives.
There may be K, as in:
?1 <--- <POS(1)> <----------
/
?2 <--- <POS(2)> <-------/
/
... ...
/
?K <--- <POS(K)> <---/
This is programmed efficiently by:
FOR P3 $E P2.LEFT;
DO
IF P3.POS = <POS(1)> THEN ?1 FI
IF P3.POS = <POS(2)> THEN ?2 FI
...
IF P3.POS = <POS(K)> THEN ?K FI
END
---------------- Parentheses in previous paragraph mean subscripting! ---
Each of the ?1, ?2, ... ?K may be give actions, or further want-trees
to the left. For example, our ?K might itself be a want-tree, as
shown in:
?1 <--- <POS(1)> <---------
/
?2 <--- <POS(2)> <------/
... ...
/
?K1 <--- <POS(K1)> <--- <POS(K)> <---/
/
?K2 <--- <POS(K2)> <-/
... ...
?KJ <--- <POS(KJ)> </
The ?K (that which is to the left of <POS(K)>) is another want-tree.
Naturally, this is programmed as before via:
FOR P3 $E P2.LEFT;
DO
IF P3.POS = <POS(1)> THEN ?1 FI
IF P3.POS = <POS(2)> THEN ?2 FI
...
IF P3.POS = <POS(K)> THEN
" ?K ... "
FOR P4 $E P3.LEFT;
DO
IF P4.POS = <POS(K1)> THEN ?K1 FI
...
IF P4.POS = <POS(KJ)> THEN ?KJ FI
END
FI
END
---------------- Parentheses in previous paragraph mean subscripting! ---
16.2.4
Ordering In Columns And Want-Trees
Let's concentrate on one level in the want-tree:
FOR P3 $E P2.LEFT;
DO
IF P3.POS = <POS(1)> THEN ?1 FI
IF P3.POS = <POS(2)> THEN ?2 FI
...
IF P3.POS = <POS(K)> THEN ?K FI
END
If the FOR-quantifier generates N iterations, then in total, we perform
N*K comparisons. We can reduce this multiplication down to an addition
so as to arrive at only N+K comparisons.
---------------- Parentheses in previous paragraph mean subscripting! ---
Let's rephrase this program fragement as:
FOR P3 taking on elements in P2.LEFT
DO
IF P3.POS = <POS(1)> THEN ?1 FI
IF P3.POS = <POS(2)> THEN ?2 FI
...
IF P3.POS = <POS(K)> THEN ?K FI
END
Each part-of-speech is an integer. Let's put the IF-statements in
order, so that <POS(1)> is the smallest of the parts-of-speech, and
<POS(K)> is the largest.
---------------- Parentheses in previous paragraph mean subscripting! ---
Suppose also that the column (CHOICE_OF_PHRASES) P2.LEFT also has
elements appearing in order from the smallest to the largest part-of-
speech.
We achieve the same behavior more efficiently by writing the program
as:
FOR P3 taking on elements in P2.LEFT
~while P3.POS =< <POS(1)>
DO
IF P3.POS = <POS(1)> THEN ?1 FI
END
FOR P3 taking on ~subsequent elements in P2.LEFT
~while P3.POS =< <POS(2)>
DO
IF P3.POS = <POS(2)> THEN ?2 FI
END
...
FOR P3 taking on ~subsequent elements in P2.LEFT
~while P3.POS =< <POS(K)>
DO
IF P3.POS = <POS(K)> THEN ?K FI
END
Here, we scan the column P2.LEFT only far enough to see <POS(1)>.
For each of those iterations, we perform the IF-statement for <POS(1)>.
If <POS(1)> exists at all in P2.LEFT, then we will see it in at least
one of these iterations.
---------------- Parentheses in previous paragraph mean subscripting! ---
Having considered <POS(1)>, we continue scanning ~where ~we ~just
~left ~off. That is, since <POS(2)> is greater than <POS(1)>, and
because the column P2.LEFT is ordered increasingly, any occurence of
<POS(2)> in the column appears ~after all occurences of <POS(1)>.
Therefore, we don't have to restart scanning at the beginning of
P2.LEFT in our effort to find <POS(2)>.
---------------- Parentheses in previous paragraph mean subscripting! ---
Since we always continue scanning from where we had left off, we
make a total of one complete scan thru the column. On each of those
N iterations, we execute only one IF-statement.
This overall program has length K. Therefore, one scan thru the
column of length N along with the execution of this program of length
K costs only N+K time. Depending on how the comparisons are
implemented, we will perform a total of N+K comparisons, or twice that
at most. This is better than N*K, particularly for large grammars,
where K may be large.
This behavior is like that of a zipper. One side of the zipper is the
column, and the other is the program. We travel only forward in both.
Figure 16.3 illustates this, showing each comparison by a line
connecting the two columns together. Only at the values 2 and 70 is
a match found. However, we've scanned each column only once in total.
Since a cost of N+K is less than that of N*K, let's in fact always
compile grammars as just shown, and let's also insist that any column
(CHOICE_OF_PHRASES) be ordered increasingly. Of course, GIVE must be
changed so as not to always append its new element at the front.
GIVE must search the column, like it is doing anyway in its effort to
supress duplicate phrases. GIVE should insert the new element
somewhere in the column so as to preserve the new ordering requirement.
BOX: What savings are offered by ordering both columns and
BOX: want-trees?
16.2.5
More Factoring
Another optimization involves collecting together all PHRASE blocks
having the same part-of-speech. Figure 16.4(a) shows a column having
three <FORM> PHRASE blocks. The parts-of-speech are the same, but
the LEFT fields differ. The number of different LEFT fields can be
large, possibly as large as the number of characters in the input
string (worst case). If a rule's want-phrase is looking for an <ATOM>,
it will still have to see three <FORM>s before arriving at the <ATOM>.
Figure 16.4(b) shows an equivalent representation. Each part-of-speech
is represented only once, and each has a sub-list containing all the
different LEFT values for that part-of-speech. In each PHRASE block,
we have replaced the two fields SEM and LEFT with a single field
SEM_AND_LEFT which references a list of SEM/LEFT pairs.
The part-of-speech (POS) is now represented separately from the
possible SEM/LEFT pairs. This figure is built in terms of the new
types:
TYPE NEW_PHRASE = [ SEM_AND_LEFT: SEM_LEFT POS: INT
ALT: NEW_PHRASE ] ;
SEM_LEFT = [ LEFT: NEW_PHRASE SEM: BASIC_PROCESS
ALT1: SEM_LEFT ] ;
The part-of-speech need be compared only once, and all of its SEM/LEFT
pairs can be accessed with no further compares.
This arrangement also speeds up GIVE's search for duplicate phrases.
Given the new part-of-speech, GIVE merely searches a list of length
P (the number of possible parts-of-speech) to find a matching part-of-
speech, and then searches a list of length N (the number of input
characters) in order to find a LEFT field that matches the given LEFT
field. This cost is N+P, as opposed to our old cost of N*P (the total
length of the list in figure 16.4(a)).
Also, if someone wants to determine whether a given part-of-speech
exists in a column, such a determination costs only a search thru
a list of length P, which is independent of N.
BOX: What advantages are provided by factoring out the
BOX: part-of-speech from the SEM and LEFT fields?