CHAPTER 14 ACHIEVING TERMINATION AND AN UPPER-BOUND FOR ALL CONTEXT-FREE GRAMMARS Consider two observations about the parser. 14.1 Infinite Execution If the grammar contains a rule like: <A> -> <A> or a set of rules like: <A> -> <B> <B> -> <A> then our parser will never finish execution. Whenever you pass an <A> to GIVE, GIVE will call the grammar. The grammar will see that <A> in the IF-statement for the rule: <A> -> <B> The matched rule then calls GIVE with <B>. GIVE calls the grammar again and this time the rule <B> -> <A> applies, calling GIVE once more with an <A>. GIVE calls the grammar and the grammar calls GIVE forever. Neither ever has a chance to return. GIVE is called recursively with the sequence of parts-of-speech: <A>, <B>, <A>, <B>, <A>, ... One could imagine banning such rules. However, they come up in the domain of datatypes, e.g., int -> real real -> int This particular pair of rules might be objectionable to some people, particularly the rule: real -> int where the fractional part of the real is lost. There are however many such pairs of rules that are very useful (e.g., Section 23.5.3.2). Upward compatability is often achieved by introducing the pair of rules: old_type -> new_type new_type -> old_type The "old_type" is used presently in the program, but newer programs need to use instead the "new_type". This pair of rules allows this to occur without changing any of the old system. Any place that the old_type is required, you can supply either the old_type or the new_type. Similarly, any place the new_type is required, you can once again supply either the old_type or the new_type. More about this later. 14.2 Exponential Execution Even if we were to ban such coercion rules that lead to infinite parser execution, our parser could still consume exponential time and memory, as a function of the length of the input string. Consider figure 14.1. It shows three occurences where there are two identical PHRASE blocks in a CHOICE_OF_PHRASES. The PHRASE blocks are identical in part-of-speech and in the contents of their LEFT fields. Suppose you have the rule: <B> <A> -> <X> This rule will see ~four occurences of its want-phrase emenating from the rightmost CHOICE_OF_PHRASES. (There are two A's, and to the left of each, there are two B's). Figure 14.1(b) shows the result of those four matches. There are now four <X> phrases. Now supposed we have another rule: <C> <X> -> <Y> This rule's want-phrase has ~eight matches. Each of the two <C>s can participate with four <X>s. Figure 14.1(c) shows the eight phrases produced by this rule. In general, a sequence of N duplicates will give rise to 2^N PHRASE blocks. (We've just seen the cases for N=2 and N=3). BOX: How can our parser presently get into an infinite loop? BOX: BOX: How can exponential time be consumed? 14.3 Duplicate PHRASE Blocks Cause Only Redundencies When you call GIVE with a PHRASE block, GIVE behaves in a way which is dependent only on the PHRASE block alone. GIVE reads no other variables. (Although technically it reads C in the operation: C:= C $> P ; the contents of C affect in no way GIVE's behavior, beyond this one statement). Since GIVE depends only on the PHRASE block, you can predict GIVE's behavior. If you pass to GIVE that same block, GIVE will behave exactly as before. GIVE behaves as before when given a PHRASE block whose LEFT and POS fields match those of the original. Suppose we find on C a PHRASE block whose LEFT and POS match those of P, the PHRASE block passed to GIVE. GIVE will behave ~now exactly as it did before, when GIVE received the first one. GIVE's execution, including its calls to the grammar, will generate now exactly the same phrases as it did before. If we have GIVE supress its execution when given a duplicate PHRASE block, there will be no shortage of phrases. Any phrases dependent on the PHRASE block were already generated, upon first encounter. We redefine GIVE now to supress duplicates: DEFINE GIVE( P: PHRASE ): BEGIN VAR P1=PHRASE; IF FOR P1 $E C; NEVER (P1.POS=P.POS) & (P1.LEFT=P.LEFT) THEN C:= C $> P; the_grammar(P); FI END ENDDEFN Paraphrased, this reads as: DEFINE GIVE( P: PHRASE ): IF for all P1 in C, ~never is there a P1 which matches P (where P1 and P have identical POS's and LEFT's) THEN do as before ENDDEFN GIVE now ignores a given PHRASE block if C already contains an identical one. We will return to this modification of GIVE when we consider semantics, in Chapter 15. (The old and the new identical PHRASE blocks might differ in semantics. Presently, we keep only the semantics associated with the ~first occurence of a duplicate PHRASE block, as all subsequent duplicates are ignored entirely). BOX: Why does GIVE behave identically the second time, BOX: when given a duplicate PHRASE block? 14.4 Termination And Polynomial Upper Bound For Context-Free Grammars With the modification to GIVE, our parser will now finish execution in finite time for all context-free grammars, including those containing cyclic coercion rules, like: <A> -> <B> <B> -> <A> In fact, not only can we prove that the parser terminates, but that it consumes at most polynomial time and memory, as a function of N, the length of the input string. What we show here is only ~worst ~case behavior. In practice, things rarely behave in the worst case manner. (There may be ~local areas of worst-case behavior, but the locality keeps the effective N tolerably small). 14.4.1 Any CHOICE_OF_PHRASES Has At Most N*P Elements The following is an essential observation that is true only for context-free grammars: All calls to GIVE from the grammar program pass in a PHRASE block whose LEFT field is taken from an ~existing PHRASE block. (The LEFT field is taken from the leftmost PHRASE block in the want-phrase). Thus, the context-free grammar never generates a new CHOICE_OF_PHRASES. All LEFT values are taken from existing PHRASE blocks. There is only one place where a LEFT field is ~not taken from an existing PHRASE block. While taking user input, we GIVE a PHRASE block whose LEFT is the CHOICE_OF_PHRASES built up for the previously input character. That left CHOICE_OF_PHRASES is new. It does not yet come from some PHRASE block's LEFT field. Since the taking of user input creates N CHOICE_OF_PHRASES when given N characters, there are N CHOICE_OF_PHRASES total. (None are generated by context-free grammars). Thus, given any PHRASE block, its LEFT field can take on at most N distinct values. Consider the N'th CHOICE_OF_PHRASES, which contains the N'th input character. How many PHRASE blocks can it contain? All its PHRASE blocks can have at most N distinct LEFT fields. However, two PHRASE blocks with identical LEFTs can both exist if they have ~different parts-of-speech. We supress PHRASE blocks only if they are identical in both the LEFT and POS fields. If the grammar has only P parts-of-speech throughout its rules, then we know that the maximum number of distinct PHRASE blocks in the CHOICE_OF_PHRASES is N*P That is, each PHRASE block is a pair: (LEFT,POS) where there are N possible LEFTs and P possible parts-of-speech. BOX: Why does a CHOICE_OF_PHRASES have at most N*P PHRASE BOX: blocks for context-free grammars? BOX: BOX: Why is this still true even if the grammar includes BOX: the pair of rules: BOX: BOX: <A> -> <B> BOX: and BOX: <B> -> <A> ? 14.4.2 How Many Times Can The Grammar Be Called Upon Taking In The N'th Character? Given the N*P upper bound on the number of PHRASEs in a CHOICE_OF_PHRASES, we now discover how many times the grammar can be called. GIVE calls the grammar only when given a PHRASE block that has no duplicate in C. ~There ~are ~at ~most ~N*P ~such ~PHRASE ~blocks. Suppose we discover that the grammar is called N*P+1 times. Since each call to the grammar corresponds to the introduction of another element onto C, C must now contain N*P+1 PHRASE blocks. This is impossible since there can be at most N*P elements on C. Thus, upon taking in the N'th character, the grammar will be called at most N*P times. BOX: Why can the grammar be called only N*P times, while BOX: processing the N'th character>? 14.4.3 How Many Times Can The Grammar Call GIVE? For each call to the grammar, the grammar can itself call GIVE only a certain number of times. The grammar calls GIVE from within the big FOR-quantifiers that represent the rules' want-phrases. How many iterations can those FOR-quantifiers generate? Consider a rule whose want-phrase is of length L. That rule is represented by: P(l):= P; IF P(l).POS = <POS(l)> THEN FOR P(l-1) $E P(l).LEFT; WITH ... ; !! FOR P(l-2) $E P(l-1).LEFT; WITH ... ; !! ... !! FOR P(1) $E P(2).LEFT; WITH ... ; DO call GIVE END FI Let's ignore the WITH-clauses. They serve only to reduce the number of iterations. ---------------- Parentheses in previous paragraph mean subscripting! --- How many iterations can be caused by one FOR-quantifier, e.g.: FOR P(i-1) $E P(i).LEFT; P(i).LEFT is a CHOICE_OF_PHRASES. Any CHOICE_OF_PHRASES can contain at most N*P elements, and so this quantifer causes at most N*P iterations. ---------------- Parentheses in previous paragraph mean subscripting! --- (P(i).LEFT is a CHOICE_OF_PHRASES to our left, and so if it contains the K'th character, then K < N. This quantifier can cause actually only K*P iterations, which is less than our pessimistic N*P). ---------------- Parentheses in previous paragraph around "i" ---------------- mean subscripting! ------------------------------------ For a rule whose want-phrase has length L, there are L-1 quantifers nested within one another. Since each causes at most N*P iterations, the whole grand quantifier can cause (N*P)^(L-1) iterations. Let L be the length of the ~longest want-phrase. Each rule thus causes ~at ~most (N*P)^(L-1) iterations. If there are R rules, then one call to the grammar can cause R * (N*P)^(L-1) iterations. Each iteration may call GIVE once. We now know the following upon taking in the N'th character: 1) The grammar is called at most N*P times. 2) Each call to the grammar can generate at most R*(P*N)^(L-1) iterations, or calls to GIVE. We also know: a) Each call to the grammar, ~excluding ~its ~calls ~to ~GIVE, can consume at most this amount of time (by counting the number of iterations). b) Each call to the grammar can issue at most this many calls to GIVE (one per iteration). Since the grammar itself can be called at most N*P times, the ~total time taken in all calls to the grammar, excluding its calls to GIVE, is at most: (N*P) * R*(N*P)^(L-1) This is also the maximum number of times total that GIVE can be called from all executions of the grammar. Simplified, this number is: R * (N*P)^L BOX: BOX: How many iterations, or calls to GIVE, can occur upon BOX: calling the grammar only once? BOX: BOX: There is an exponential here, but that exponent is BOX: ~not N. What is the exponent? 14.4.4 How Much Time Is Spent In GIVE Excluding Its Call To The Grammar? We know how many times GIVE can be called total. How much time is spent within each call to GIVE? GIVE must look thru C, which can have at most N*P elements in it. The total time taken inside the GIVE program, for all calls to GIVE, is thus N*P times the totally number of calls to GIVE: R * (N*P)^(L+1) At most this much time is spent within the GIVE program. As noted earlier, the total time taken inside the grammar program is at most: R * (N*P)^L The dominating factor between these two, GIVE and the grammar, is the time taken inside GIVE: (N*P)^(L+1) (We omit the R because we will understand that the actual time spent is ~proportional to (N*P)^(L+1)). 14.4.5 Time Taken To Process All N Characters We now know that the time taken to process the N'th character is at most: (N*P)^(L+1) How long does it take to process all characters, one thru N? Each one takes this much time, and so the processing of N characters takes N times that amount: N^(L+2) (We've dropped the P because P, the number of parts-of-speech, is constant no matter what N is. We'll make it part of the proportionality factor). This number represents the most time it could take to parse N characters, for any context-free grammar. L is the length of the grammar's ~longest want-phrase. BOX: Suppose you ~double the number of parts-of-speech BOX: in the grammar, how will ~worst-case parsing time BOX: be affected? 14.4.6 What The Upper Bound Means We've seen that the processing of N input characters is proportional to: N^(L+2) This is a polynomial upper bound as a function of N. This applies not only to execution time, but to memory as well. The creation of each block consumes some time, say T. Given only N^(L+2) execution time, the greatest number of memory blocks that can be created is: N^(L+2) / T T is independent of N, and so the total amount of memory used is proportional to N^(L+2). If you wish, you can reform the grammar, automatically, so that the longest want-phrase has length L=2. Given a rule whose want-phrase is greater than two, say L: <POS(1)> <POS(2)> ... <POS(L)> -> <GIVE_POS> you can replace it by ~two rules, which involve a brand new part-of- speech, say <X>: <POS(1)> <POS(2)> -> <X> <X> <POS(3)> <POS(4)> ... <POS(L)> -> <GIVE_POS> The rule whose want-phrase contains <X> has now length L-1. This process can be repeated in finite time for all rules. The resulting grammar will have the longest want-phrase, L, be 2. A grammar rendered this way is said to be in ~Chomsky ~Normal ~Form. With such a grammar, our parsing time is N^4. ---------------- Parentheses in previous paragraph mean subscripting! --- In practice however, rules with long want-phrases typically involve some terminal characters. The FOR-quantifier for a terminal character causes only one iteration. While using this parser in a production environment for over ten years, we've never reduced grammars to Chomsky Normal Form, and we've had acceptable performance throughout. Still, if you want to introduce more efficiency into this parser, you can reduce the N^4 to N^3, if you forbid general rewrite rules. Recall that one of the factors of N came from within GIVE, as it searched C for duplicate PHRASE blocks. We can remove that factor of N by constructing a one-dimensional array to go along with C. The array has N entries. Each entry corresponds to one of the N possible values that can appear in a LEFT field. Each entry in the array contains a list of PHRASE blocks, all of which have the ~same ~value in LEFT. There are at most P elements in such an entry. Thus, given a new PHRASE block, we use its LEFT to index into the array, at which point the chosen entry contains at most P list elements. Those P must be searched, but that time is independent of N. The factor of N has disappeared by indexing into the array of length N and searching a list of length P, instead of searching a single list of length N*P. We've experimented with long strings, contrived ones to be sure, that would cause the parser to behave in the worst case manner. The N^4 part of the polynomial had insignificant effect on execution time, because its coefficient is very small. When we observed long execution times, we ~doubled the length of the input string, and yet observed not 16 times as much execution time (as would be the case of the N^4 were significant), but only 7 times as much execution time. Even the N^3 component wasn't completely dominating the execution time. In conclusion, the N^3 component made things too bad before the N^4 component ever had a chance to contribute to the slow behavior. These experiments used ICL's syntax grammar, which is not in Chomsky Normal Form. Languages designed to maximize human convenience benefit by what is delivered for nonlinear costs. Allowing the programmer to specify A + B # C without parentheses is a nonlinearity. However, people rarely make long expressions that involve nonlinearities. For example, this nonlinearity may be contained in an assignment statement like: P:= A + B # C ; N is small for the "A+B#C" Another, richer example of valuable non-linearity concerns the expression: X#Y \BOX U#V \PAINTED GREEN \ROTATED 90 \DEGREES \UNION B \BLOATED 10 Here, we are specifying a box that is painted green and rotated by 90 degrees, along with (\UNION) the box B bloated by 10 units. The grouping parentheses can be known only after syntax processing, when the types of X,Y,U,V, and B become known. This example takes advantage of ICL' syntax which allows functions to be called with an ~infix notation (\function_name). ICL's infix notation provides for the very readable expression. If X,Y,U, and V are REALs, and B is a box, then the grouping becomes: ((((X#Y) \BOX (U#V)) \PAINTED GREEN) \ROTATED (90 \DEGREES)) \UNION (B \BLOATED 10) In the absence of the infix function calling notation, this expression would have to be specified in the usual, prefix form: UNION( ROTATED( PAINTED( BOX( #(X,Y) , #(U,V) ), GREEN ), DEGREES(90) ) , BLOATED( B , 10 ) ) People tend to group their thoughts in a hierarchical fashion, e.g., like sentences in a paragraph, paragraphs in a section, sections in a chapter etc. At each level in the hierarchy, N is usually small, perhaps about 7. (N is rarely, if ever, more than 50). Thus, languages designed specifically for human convenience rarely involve large N at any one place. People often cannot understand a very long sentence, and the speaker may be asked to rephrase it. For a computer to ask for rephrasing would be very natural and acceptable, if the language is convenient for humans. In contrast, if the person had to convolute his or her thinking so as to talk in an awkward language, he or she would be understandably frustrated at the computer's rejection. Language designed for convenient human use must often involve non- linearities, but the human can express his or her thoughts so briefly that the non-linearities are never significant. ----------- previous paragraph all italics ---------------- When computers talk among themselves, their common language need not be convenient for people. One can make very simple grammars for the computers to use. Computers can generate and understand long expressions in such simple grammars in linear time. Such a language might be postfix polish, where 1 2 + represents what we would write as 1+2. Similarly, 1+2*3 would be written as: 1 2 3 * + These kinds of languages will cause our parser to consume only linear time. BOX: Why is nonlinearity tolerable when dealing with human BOX: languages? BOX: BOX: Programming is a human activity. Should new BOX: programming languages support briefer and clearer BOX: expression? BOX: BOX: Given a programming language that is more human BOX: oriented than today's languages, why can we then BOX: tolerate nonlinearities? BOX: BOX: Which of the three forms of the long programming BOX: expression is easiest for you to read?