CHAPTER 17


				MISCELLANEOUS TOPICS IN PARSING



	  This chapter has three independent parts.  The first shows a way by
	  which syntax errors can be reported for languages in general.  The
	  second part introduces the ~scanner, a preprocessor which deals with
	  white space.  Finally, we consider other parsing methods.


17.1
SYNTAX ERROR REPORTING


	  Errors are often committed while writing programs.	Some errors have
	  to do with syntax.  A comma or semicolon might be missing or misplaced.
	  Other errors have to do with semantics.	 A semantic error might be
	  a mispelling of a function name, or forcing a function to combine
	  data of types the function isn't prepared to handle.

	We concentrate here on reporting syntax errors.  Semantic, or ~datatype
	errors are considered in Chapter 30.

	The method works for all languages, whose rules of grammar are
	unrestricted.


17.1.1
Strategy

	Syntactically correct specifications have the property that the entire
	text can be reduced to one, full-spanning part-of-speech.  The
	following correct (ICL) specification is rewritable into one, full-
	spanning occurence of the part-of-speech <STATEMENT>:

		WRITE(4);  WRITE(5);  WRITE(1+2*3);  WRITE(6);

	Sometime during the rewriting process, these four statements are
	rewritten to:

		<STATEMENT>  <STATEMENT>  <STATEMENT>  <STATEMENT>

	These are ultimately reduced to one <STATEMENT>, thanks to the rule:

		<STATEMENT>  <STATEMENT>	->	<STATEMENT>


	Suppose now that we introduce a syntax error, changing the "+" into
	the meaningless "%":

		WRITE(4);  WRITE(5);  WRITE(1%2*3);  WRITE(6);

	We can't quite interpret this as four <STATEMENT>s.  Certainly, the
	  first, second, and fourth statements can be seen as <STATEMENT>s, but
	  the third can't.  If we nonetheless try to complete the rewriting as
	much as possible, we get:

		<STATEMENT>    WRITE( <EXPR> % <EXPR> );    <STATEMENT>

	The first ~two statements can be reduced all the way to a single
	<STATEMENT>.  Similarly, the fourth can be rewritten to <STATEMENT>.
	The third statement, the "WRITE(...);" visible above, can't be
	  rewritten into a <STATEMENT>.  However, some rewrites can be carried
	  out.  The "1" is rewritten into an <EXPR>, because it is a correct
	  <EXPR>.  Also, the "2*3" can be rewritten to <EXPR>.

	  We propose that this latest line, which is reduced as far as possible,
	  ~be the syntax error message.  Upon looking at it, we see that much
	  valid rewriting has occured.  The unrewritten part, the
	  "<EXPR> % <EXPR>" stands out.  ~Erroneous ~parts ~stand ~out while
	  correct parts are compact.	Upon seeing that the "<EXPR> % <EXPR>"
	  won't reduce, one might recall suddently that "%" can't combine two
	  <EXPR>s.	(If one comes away believing that "%" is a valid operator,
	  one should try to find a syntax rule of the language which involves
	  "%" with a pair of <EXPR>s, which one then discovers doesn't exist).

	We deliver as the error message the ~shortest possible phrase that
	spans the entire input text.

	Here is another syntax error:

		WRITE(  1+2*3  ;

	The shortest possible phrase spanning this input is:

		WRITE(  <EXPR>  ;

	In this partially compressed form, the missing close parenthesis
	stands out.

	Unfortunately, the shortest phrase might be misleading, as in the
	following example.  We would expect the text:

		{  ABS(5)  ;  6 % 7  }

	to reduce to:

		{  <EXPR>  ;  <EXPR> % <EXPR>  }

	There is however a shorter phrase:

		{  <STATEMENT>   <EXPR> % <EXPR>  }

	The "ABS(5);", ~including the semicolon, can syntactically be seen as
	a <STATEMENT>, as though ABS were the name of a procedure (a non-value-
	returning function).

	The purpose of a syntax error message is to jiggle the mind of the
	programmer so he or she can see the error.  It is ~not to provide a
	correction.  If a system can correct errors, then that small set of
	correctable errors could just as well be part of the language.
	Usually, proposed corrections themselves are misleading.

	This form of syntax error message has worked well during ICL's use.
	  The experienced ICL programmer can discover his or her errors
	  immediately.  Newer users don't fare quite as well for a while.  Some
	knowledge of the rules of grammar is required.  For example, the error
	message:

		WRITE(  <EXPR>  ;

	makes sense only if you know that calls to functions (procedures)
	require a close parenthesis as well as the open parenthesis.  Also,
	to interpret the error message involving the "ABS(5);" requires the
	user to know that a "<STATEMENT>" might be an "<EXPR>;" is disguise.
	With a little experience, however, these messages are effective.

		BOX:	What is the syntax error message?
		BOX:
		BOX:	How could it be misleading?


17.1.2
Implementation

	How do you go about finding the shortest full-spanning phrase?
	One could work directly off of the CHOICE_OF_PHRASES structure.
	However, there is an easier method that involves merely the
	introduction of a small grammar.

	We invent a new part-of-speech called SYNTAX_ERROR.  We provide a
	set of new rules involving SYNTAX_ERROR.  Here is the first:

	   1)		any_character  <SYNTAX_ERROR>	   ->    <SYNTAX_ERROR>

	This rule by itself will render ~any sequence of characters, like the
	input text, into a full-spanning SYNTAX_ERROR.

	If we GIVE SYNTAX_ERROR on the righthand end of the input text, then
	this rule will first apply by consuming the single, last character.
	This rule's second application will then pick up the second to last
	  character.  By applying this rule in such a right-to-left manner, the
	  entire input text will rewrite ultimately into one full-spanning
	  SYNTAX_ERROR.

	  This single-rule grammar acknowledges only characters, and not parts-
	  of-speech like EXPR or STATEMENT.	 We let SYNTAX_ERROR gobble up
	  EXPRs and STATEMENTs as well as characters.  We introduce two more
	  rules similar to the first:

	     2)		<EXPR>  <SYNTAX_ERROR>		  ->	    <SYNTAX_ERROR>

	     3)		<STATEMENT>	 <SYNTAX_ERROR>	  ->	    <SYNTAX_ERROR>

	  We might include two more:

	     4)		<ID>	<SYNTAX_ERROR>		  ->	    <SYNTAX_ERROR>

	     5)		<QUANTIFIER>  <SYNTAX_ERROR>	  ->	    <SYNTAX_ERROR>

	  All these rules indicate that we're willing to see to the left of
	SYNTAX_ERROR, an EXPR, STATEMENT, ID, or QUANTIFIER.

	

	Beyond the first rule, the others introduce ambiguities.  For example,
	figure 17.1 shows two possible parsings for the ID ALPHA when
	SYNTAX_ERROR appears to its right.  "ALPHA" can be seen as a single
	ID, or as a sequence of five characters.

	Of all possible SYNTAX_ERROR interpretations, we would like the
	shortest one.  For example, we want the interpretation where ALPHA
	is an ID rather than the other, longer, five character interpretation.


17.1.2.1
The Semantics Of SYNTAX_ERROR

	The ambiguity shown in figure 17.1 will appear as an ~OR-block under
	the one full-spanning part-of-speech SYNTAX_ERROR (Chapter 15).  It is
	at that OR-block that we want to choose one interpretation over the
	other.

	Let's establish that the semantics of SYNTAX_ERROR is an action which
	  sets two global variables:

		    VAR	MESSAGE = TEXT ;

				COST = INT ;

	  The semantic action of SYNTAX_ERROR puts into MESSAGE the (compacted)
	  text over which the SYNTAX_ERROR spans.	 We will ultimately write the
	  variable MESSAGE to deliver the finished syntax error message.	The
	  variable COST is set to the length of the text MESSAGE.

	  Our first rule, with semantics, is:

	     <ANY_CHARACTER[C]:X>  <SYNTAX_ERROR:E>	  ->

		    <SYNTAX_ERROR:

				<*E*>;  "Put the given SYNTAX_ERROR's text into
					   MESSAGE, and put its length into COST."

				MESSAGE:=  C <$ MESSAGE;	  "Append our character
									   onto MESSAGE"

				COST:= COST+1;
		    >

	  The same goes for any of the other four rules, such as the second:

	     <EXPR:X>  <SYNTAX_ERROR:E>	 ->

		    <SYNTAX_ERROR:

				<*E*>;

				MESSAGE:=  ' <EXPR> '	$$  MESSAGE ;

				COST:= COST+1;
		    >

	  The semantics of each leaves in MESSAGE the compressed text spanned
	  by SYNTAX_ERROR.  The latter rule augments MESSAGE with not the
	  original text, but the single item "<EXPR>".	COST is left holding
	  the number of items in MESSAGE.

	  The choice to minimize COST is made at OR-blocks.  Let's do the
	following at OR-blocks.  (This would replace the assignment to OR in
	the definition of GIVE, Section 15.2.1).  A and B are the two semantic
	interpretations, such as the two interpretations put together under
	the ~one full-spanning SYNTAX_ERROR.

	   OR:= //[A;B;]

		    BEGIN  VAR  M=TEXT;  C=INT;

			" Do the first interpretation... "

			<*A*>;	"Set COST and MESSAGE for one interpretation"

			C:= COST;	M:= MESSAGE;	"Remember them"

			" Do the second interpretation... "

			<*B*>;

			" Compare costs (COST and C).  Set COST and MESSAGE
			  to the one, cheapest interpretation... "

			IF  C < COST  THEN	COST:= C;
						MESSAGE:= M;
			FI

			" We have set COST and MESSAGE, as the semantics of
			  any SYNTAX_ERROR should. "

		   END  \\ ;

	Now, the OR-block under a SYNTAX_ERROR will choose the one
	interpretation involving the ID, whose cost is 1, and not the other
	interpretation whose cost is 5.

	In summary, the following is performed if a given input text fails to
	parse syntactically:

	   1)	Activate the new syntax-error grammar
	   2)	GIVE SYNTAX_ERROR to the right of the given text.
		   (This causes all the SYNTAX_ERROR rules to apply
		    appropriately).
	   3)	Pick the one full-spanning SYNTAX_ERROR.
	   4)	Invoke its semantics, which sets MESSAGE (and COST).
	   5)	Print the contents of MESSAGE.



17.1.2.2
Supressing An Exponential Cost

	Steps can be taken, as they are in Chapters 27 and 28, to avoid
	invoking any SYNTAX_ERROR's semantics ~more ~than ~once.  This
	  supresses an otherwise exponential computing cost down to a linear
	  cost.

	  Here, we can have the semantics of each SYNTAX_ERROR rule modify
	  itself objectively upon first invocation, as in:

		    ...  <SYNTAX_ERROR>			->

			 <SYNTAX_ERROR:

				BEGIN	 VAR	M=TEXT;  C=INT;

					  <*E*>;

					  MESSAGE:=	 ... $$ MESSAGE ;
					  COST:=  COST+1;

					  M:= MESSAGE;	C:= COST;

					  @(SELF_):=  //[M;C;]
								MESSAGE:= M;
								COST:= C;	    \\ ;
				END
			 >

	  The semantics here operate as before, except for the final objective
	  modification:

		    @(SELF_) := ... ;

	  This final statement overwrites the block representing this rule
	  application's semantics.  That block, SELF_, is overwritten by a new
	BASIC_PROCESS which sets MESSAGE and COST.  Upon non-first invocations,
	the new program:

			MESSAGE:= M;
			COST:= C;

	executes.  It does not recompute, rather, it sets the two variables to
	the results computed earlier, upon the ~first invocation.


17.2
THE SCANNER - DEALING WITH WHITE SPACE


	"White space" refers to the characters space, tab, carriage-return, and
	form-feed, or any combination of them.  Those characters use no ink.

	Our discussion of parsing up until now has ignored white space.  For
	example, "1+2" has no white space, and can be seen as a <FORM>.
	However, "1 + 2" does contain white space, and thus far won't be seen
	  as a <FORM>.  We will see the phrase:

		    <FORM>	~<SPACE>  +	 ~<SPACE>  <TERM>

	  but this will not trigger the "+" rule:

		    <FORM>	+  <TERM>	    ->	<FORM>


	  Should we therefore introduce the rule:

		    <FORM>	<SPACE>  +	<SPACE>  <TERM>	  ->	    <FORM>

	  which does allow for white space?	 (We would also need to introduce
	  two more rules:

		    <FORM>	<SPACE>  +	<TERM>	->	  <FORM>
	  and
		    <FORM>	+  <SPACE>	<TERM>	->	  <FORM>  ).


	  It is easier to have our program that takes user input simply ignore
	  white space.  Section 12.4 introduced the taking of user input.	 We
	  now modify it:

		    C:= NIL;

		    FOR  each input character	 CHAR
		    DO
				IF  CHAR is not white space  THEN

					  LEFT:= C;

					  C:= NIL;
					  GIVE( [LEFT:LEFT  POS:CHAR] );
				FI
		    END

	  We GIVE only characters that are not white space.


17.2.1
White Space Does Affect Numbers And Names

	  We must be careful about throwing away white space.	 White space can
	  serve to separate two names or two numbers.  For example:

		    1234 5678	  is different from	  12345678
	  and
		    ABRAHAM LINCOLN   is different from	  ABRAHAMLINCOLN

	  If our taking of user input supresses white space, then those
	  distinctions will be lost, unless we delegate to the task of taking
	  user input the recognition of names and numbers.

	  We can afford the loss of white space if we remove from the grammar the
	  recognition of names and numbers, and place the reponsibility instead
	  on the ~scanner.  The ~scanner is the process which takes user input,
	  including the recognition of names and numbers.  The program fragment
	  just shown is a simple scanner, omitting the recognition of names and
	  numbers.

	

	  Our scanner will behave like the simple program just shown, except that
	  it will perform additional GIVEs, to introduce occurences of names and
	  numbers.	An occurence of a name or a number will not replace its
	  underlying characters.  Rather, a name or a number will be introudced
	  as alternative interpretations to the raw characters themselves
	  (figure 17.2).

	  For simplicity, we will say that a name, called an ID (short for
	  IDentifier), will consist only of the alphabetical characters A thru
	  Z.	In general, IDs are usually allowed to contain digits as well,
	  and even the underscore character "_".	The reader can perform the
	  generalization.	 Numbers will consist only of digits.  (The minus sign
	  in a number will not be recognized by the scanner.	That will be left
	  to a rule of grammar).

	  We introduce two Boolean variables IS_DIGIT and IS_ALPHA, to
	  characterize the input character.	 Our scanner is programmed as
	  follows:

		    C:= NIL;

		    FOR each input character CHAR
		    DO
				IS_DIGIT:=	'0' =< CHAR	 &  CHAR =< '9' ;
				IS_ALPHA:=	'A' =< CHAR	 &  CHAR =< 'Z' ;

				worry about identifiers

				worry about numbers

				IF  CHAR is not white space  THEN

					  LEFT:= C;

					  C:= NIL;
					  GIVE( [LEFT:LEFT  POS:CHAR] );
				FI
		    END

	  At the end of each iteration, C contains the non-white input character
	  as a PHRASE block.

	  Therefore, upon the start of the next iteration, we know that
	  C contains the previous character (left over from the previous
	  iteration).  C will continue to reference the previous character while
	  we worry about identifiers and numbers.	 Only at the last minute do
	  we set LEFT to C and GIVE the current character.


17.2.2
Worrying About IDentifiers

	  Here is how we worry about identifiers.	 (A similar program fragment
	  will worry about numbers).	We introduce three new variables:

		    VAR	ID_TEXT = TEXT ;		"The text of the identifier
								 seen up to now. "

				ID_LEFT = CHOICE_OF_PHRASES ; 
						    "This will be the LEFT field for the
						     finished identifier.  It points to the
						     CHOICE_OF_PHRASES residing immediately
						     to the left of the identifier
						     (figure 17.2)."

				WAS_ALPHA = BOOLEAN;	"Was the ~previous character
								 alphabetical?

	  We "worry" as follows:

		    IF  "we are beginning an identifier ..."
								-WAS_ALPHA & IS_ALPHA  THEN

				ID_LEFT:= C;    "Remember left neighbor to this new ID.
						     (Recall that C points to the previous,
							non-alpha character)"

				ID_TEXT:= {CHAR};		"The first character of the ID"
		    FI

		    IF  "we are in the middle of an identifier"
								WAS_ALHPA & IS_ALPHA   THEN

				ID_TEXT:= ID_TEXT $> CHAR;
						    "Right-append the next character onto
						     the ID's text"
		FI

		IF  "we are beyond the end of an identifier"
						WAS_ALPHA & -IS_ALPHA   THEN

			"GIVE the identifier..."

			GIVE( [LEFT:ID_LEFT  POS:<ID>  SEM: ?? ] );
		FI

		WAS_ALPHA:= IS_ALPHA;	"In preparation for the next iteration"

	We've broken the capture of an ID into three parts, the beginning, the
	  middle, and the end.	At the beginning, we set ID_LEFT to the
	  CHOICE_OF_PHRASES to the left of the present, first alpha character.
	  In the middle, we collect up the characters.	At the end, we GIVE the
	  ID.

	  When we execute the GIVE, recall that C contains the previously
	  input character, the last character of the identifier.  (We've just
	taken the first character beyond the end of the identifier, a non-alpha
	character).  By GIVing the <ID> now, the <ID> resides on the
	CHOICE_OF_PHRASES containing the <ID>'s last character, and so the <ID>
	  and its characters share the same span.

	  The part-of-speech of identifiers is <ID>, which we declare now:

		    POS  ID : TEXT ;

	  The semantics of an ID is text, the text of the identifier.  We will
	  come back to the specification of the SEM field within the call to
	  GIVE, the "??", in a moment.  For completeness, we must set WAS_ALPHA
	  to FALSE at the very beginning, along with our scanner's first
	assignment "C:=NIL;".

	The semantics of the part-of-speech <ID> is TEXT, as we have declared.
	Recall from chapter 4 that although we declare:

		POS  ID : TEXT ;

	we actually get:

		POS  ID : //TEXT\\ ;

	The semantics of <ID> is a ~process whose invocation delivers a TEXT.

	Since the variable ID_TEXT contains the ID's text, we specify the
	  semantics, the "??" almost as:

		    //  ID_TEXT  \\

	  This is a process whose invocation will deliver the text.

	  It is not quite legal.  Recall also from chapter 2 (and Section 23.6)
	  that we need to retain access to the value presently in the variable
	  ID_TEXT, so that in the future, when these semantics are invoked,
	  ID_TEXT will then hold what it now holds.  We retain access to the
	  present ID_TEXT by writing the semantics legally as:

		    //[ID_TEXT;]	 ID_TEXT   \\

	  As a whole, the GIVE appears as:

		   GIVE( [LEFT: ID_LEFT	 POS: <ID>	SEM: //[ID_TEXT;] ID_TEXT \\]);


	  A similar treatment GIVEs <NUMBER>s when appropriate.  Figure 17.2
	  shows an ambiguous phrase created by our scanner.

	  Finally, notice how white space separates <ID>s and separates
	  <NUMBER>s.  IS_ALPHA is FALSE when given a blank character.  This will
	  serve to terminate any <ID> in the making.  While the blank character
	  terminates an <ID>, the blank itself is not GIVEn because the IF-
	  statement at the end of the FOR-loop.  White space has thus been kept
	  around only long enough to separate identifiers and numbers, but not
	  to make it into the ambiguous phrase.

		    BOX:	What service does the scanner perform?
		    BOX:
		    BOX:	Where must white space be considered?
		    BOX:
		    BOX:	Does the scanner introduce ambiguous phrases of its
		    BOX:	own?
		    BOX:
		    BOX:	Who consumes <ID>'s?  (See Part 6 for examples).


17.3
OTHER PARSING METHODS

	To round out our discussion of parsing, we present now other parsing
	methods.  These methods are in wide use.  However, they are more
	restricted than ours and require greater effort to implement because
	they shun ambiguity.


17.3.1
Recursive Descent

	Perhaps the most straightforward and simplest parsing method is
	~recursive ~descent, which is workable for some context-free grammars.
	Beyond relatively simple grammars however, this method can become
	unwieldy, and might require something called ~backtracking.
	Backtracking refers to the retreat of one train of thought in favor of
	trying another.  In general, backtracking can introduce an exponential
	cost as a function of the length of the input string.

	The delightful thing about recursive descent is how easy it appears to
	transform rules of grammar into functions in a programming language.
	We write one function for each (non-terminal) part-of-speech.  For
	example, consider the rules:

		<FORM>	::=	<FORM>  +  <TERM>
		<FORM>	::=	<TERM>

	For a given part-of-speech, e.g., <FORM>, we write one function, called
	FORM, which represents all the rules whose give-phrases are <FORM>
	(The give-phrase resides to the left of the "::=").  The FORM function
	is to be called whenever we expect to see a <FORM> next.  Similarly,
	we call the function TERM whenever we expect to see a <TERM>.

	For example, consider now just the first rule.  We transform it into a
	function that looks almost the same:

		DEFINE	FORM:

			FORM;
			IF  OK  &  CHAR = '+'  THEN
				NEXT_CHAR;
				TERM;
			ELSE  OK:= FALSE;     FI
		ENDDEFN

	It says that in looking for a <FORM>, see first a <FORM> (the first
	line), and if that succedes (OK), then insist that the next character
	(CHAR) is a "+".  Given that, consume the "+" (NEXT_CHAR) and proceed
	to find a <TERM>.  If this scenario is impossible, then set OK to
	FALSE.  Each such function has the option to fail, simply by setting OK
	to FALSE.

	Unfortunately, this function FORM will call itself before doing
	anything else.  That call to FORM will again call FORM, ad infinitum.
	This infinite recursion is a sympton of the rule's own ~left
	  ~recursion.  A rule is ~left ~recursive if the first part-of-speech in
	  the want-phrase (that which is to the right of the "::=") matches the
	  rule's give-phrase (left of "::=").  The straightforward implementation
	of left recursive rules always causes infinite recursion.

	We avoid left recursion by rewriting the rule instead as:

		<FORM>	::=	<TERM> + <FORM>

	This rule has no left recursion, although it does have right
	recursion (the <FORM> on the far right).  Right recursion is harmless,
	as in the following rendition of this rule:

		DEFINE	FORM:

			TERM;
			IF  OK  &  CHAR = '+'  THEN
				NEXT_CHAR;
				FORM;
			ELSE  OK:= FALSE;	FI
		ENDDEFN

	This causes no infinite recursion because the call to FORM happens
	only after consuming at least one character (the "+").  Therefore,
	the recursion can go only as deep as the number of characters in the
	input string.

	We encapsulate both rules:

		<FORM>	::=	<TERM> + <FORM>
		<FORM>	::=	<TERM>

	by writing:

		DEFINE	FORM:

			TERM;
			IF  OK  &  CHAR = '+'  THEN
				NEXT_CHAR;
				FORM;
			FI
		ENDDEFN

	Surprisingly, we have removed the ELSE caluse and simultaneously
	accomodated the second rule.  That is, if the character following the
	first TERM is not a "+", then we don't consume it (NEXT_CHAR).
	  Instead we just succeed, having performed now the second rule.

	  Similarly, the rules:

		    <TERM>	::=	  <ATOM> * <TERM>
		    <TERM>	::=	  <ATOM>

	  map to:

		    DEFINE	TERM:

				ATOM;
				IF  OK  &  CHAR = '*'  THEN
					  NEXT_CHAR;
					  TERM;
				FI
		    ENDDEFN

	  Finally, the rules:

		    <ATOM>	::=	  ( <FORM> )
		    <ATOM>	::=	  <NUMBER>

	  map to:

		    DEFINE	ATOM:

				IF  CHAR = '('  THEN
					  NEXT_CHAR;
					  FORM;
					  IF	CHAR = ')'	THEN
						    NEXT_CHAR;   " and succeed "
					  ELSE  OK:= FALSE;  FI

				ELSE	  NUMBER;		FI
		    ENDDEFN

	  For recursive descent to work without backtracking, it must be possible
	  to decide which ~one of the <FORM> rules to pursue just by looking at
	  the next character (CHAR), or possibly several characters ahead.  The
	  presence or absence of a "(" make the decision easy for the <ATOM>
	  rules, in the function ATOM.

	  Fortunately, with the <FORM> rules (and also the <TERM> rules), there
	  is no need to distinguish the alternatives, because both alternatives
	  begin with <TERM> (or <ATOM> for the <TERM> rules).	 Only after the
	  <TERM> is consumed do we need to distinguish the alternatives.	The
	  presence or absence of a "+" makes the distinction.	 We are lucky with
	  this grammar.  Notice that the FORM function consumes ~as ~much ~as
	  ~possible, taking in as many "+"s as possible.

	  Our good fortune breaks down if we introduce a rule like:

		    <?>	::=	  <FORM> + <NUMBER> %

	  FORM's maximal consumption of "+"s would consume the final "+<NUMBER>"
	in:

		<NUMBER> + <NUMBER> %

	FORM's greediness will never let our latest rule apply, even when it
	  is theoretically necessary, as in this latest phrase.

	  As grammars contain more and more rules, these problems multiply.
	  (For example, our model language ICL has nearly 150 rules).  Imagine
	  the difficulty in introducing a new rule using recursive descent.
	  With our ambiguous method, in contrast, adding a new rule is a trivial
	  effort.

	  Backtracking can be introduced so as to allow the trial of more than
	  one alternative.  One can pursue one alternative until failure is
	  discovered, and then pursue another alternative.  Unfortunately, while
	  pursuing one alternative, another occurence of backtracking might be
	  necessary.  Within each of those alternatives, backtracking may occur
	  even again.

	  Backtracking within backtracking, etc., will wind up considering in
	  total an exponential number of alternatives, and hence building up
	  an exponential cost.	Assuming that there is always a choice of two
	  alternatives, a bifurcation can occur at least as often as one per
	  character in the given input string, giving rise to 2-to-the-N'th
	alternatives where N is the length of the input string.

		BOX:	How are rules of grammar transformed into programs
		BOX:	in recursive descent?
		BOX:
		BOX:	When might it be impossible to translate successfully
		BOX:	rules into recursive descent programs?
		BOX:
		BOX:	Why could backtracking have an exponential cost?


17.3.1.1
Semantics In Recursive Descent

	Let's introduce semantics at least for the <FORM> rules.  We assume
	  now that any call to FORM, TERM, ATOM, or NUMBER leaves the semantic
	  value in the new variable ANSWER.	 Here is the FORM function with
	  semantics:

		    DEFINE	FORM:

				TERM;
				IF  OK  &  CHAR = '+'  THEN
					  NEXT_CHAR;
					   X:= ANSWER;	"the TERM's value"
					  FORM;
					   ANSWER:= X + ANSWER;	  "the sum of TERM and
									   FORM"
				FI
		    ENDDEFN

	  If after the TERM there is no "+", then the ANSWER as left by the TERM
	  acts also as the ANSWER for our <FORM>

	  If we were implementing "-" instead of "+", the specification of
	  semantics gets more difficult.  Due to our conversion to ~right
	  ~recursion, the expression

		    1-2-3-4

	  will group as

		    1-(2-(3-4))

	  Because "-" is not associative, this answer is different from the
	  answer we expect:

		    ((1-2)-3)-4

	  To get around this problem, we can no longer have ANSWER represent a
	  number.

	  However, by having ANSWER represent a ~list ~of ~numbers, our "-"
	  semantics need not perform the subtractions immediately.	The list of
	  numbers 1,2,3,4 can be formed, and the subtractions can occur after
	  the parsing, since we can traverse that list in any order, an order
	  not dictated by our parser's execution that groups right-to-left.

	The complications of semantic specification imposed by the
	massaging of the grammar need to fit recursive descent makes recursive
	descent difficult if not impossible to use.  It is not practical for
	parsing languages in general.

		BOX:	Are semantic specifications easily introduced into
		BOX:	recursive descent programs in general?
		BOX:
		BOX:	What is left recursion?


17.3.2
LR(k) Parsing

	For some context-free grammars, it is possible to render a parser which
	always applies the right rule at the right time.  For these parsers,
	there is no need to maintain multiple possibilities during parsing.
	The most general of these parsers are called LR(k) parsers[].

	At any point in the input string, the parser chooses what rule to
	apply, or to apply none, based on the characters seen up to this point
	and by looking ahead to the right by k characters or takens.  In
	practice, k is zero or one.  These parsers have the advantage of
	parsing in linear time always.

	LR(k) parsing never works for ambiguous grammars, and hence would
	forbid ICL expressions like:

		A+B#C

	where the grouping is determined only after syntax processing is done.

	Also forbidden would be a ~datatype grammar which allows "3" to be
	seen as either an integer or a real.  Not only are ambiguous grammars
	forbidden, many unambiguous grammars can't be handled by LR(k) either.

	  It is often useful to form new languages by taking a combination of
	  existing languages.  Even if two given languages are each LR(k), (that
	  is, they can each be parsed by an LR(k) parser), their union, the
	  language that accepts notations in either language, may not be LR(k).
	  There is no easy way to find out if a language can be implemented by
	  LR(k), other than by feeding it to an LR(k) parser and seeing if it
	  complains upon receipt of the grammar.	Modifications to the grammar
	  may be necessary, and that may complicate the specification of
	  semantics, leading to less clear language definitions.

	  The LR(k) parsing method automatically generates tables to represent
	  the grammar and to provide for the certain application of rules.  These
	  tables may grow large, and to combat that, less powerful but smaller
	  variations of LR(k) have been developed.  Felt to be the most
	  practical is the variation called LALR(k), which is used by the popular
	  YACC (Yet Another Compiler Compiler) from UC Berkeley.

	  Some people feel that the limitations imposed on languages by LR(k)
	  parsing are insignificant with regards to ~programming ~languages.
	  This attitude may reflect an assumption that programming languages
	  can't evolve far from where they are now.

	It may be that progress in the developement of new programming
	languages has been stunted by this assumption.  We certainly can't
	  know for sure until unrestricted methods are employed for some time,
	  for the easy and flexible specification of programming languages that
	  take innovative leaps towards human convenience (and hence the
	  reduction of bugs).


17.3.3
Earley's Efficient Context-Free Parser

	Earley's parser[] works for all context-free grammars as does our
	  general parser.	 Given LR(k) grammars, Earley's parser works in linear
	time, like the LR(k) method itself.  Furthermore, the ~union of two
	(or any finite number of) LR(k) grammars, which itself might not be
	LR(k), will still be parsed in linear time.  In general, Earley's
	  parser can consume as much as N^3 time and memory given an input
	  string of length N.  Our general parser may consume N^4, but handles
	  some general rewrite rules.	 As shown earlier (Section 14.4.6), our
	  parser can be modified to work in N^3 time if we limit ourselves to
	  only context-free grammars.

	  Earley's parser supports ambiguity as ours does, but it does so in a
	different way.  Where we maintain all possible ~rewriten ~strings
	upon the input string, Earley instead maintains the set of possible
	~rules that could conceivably be applying at each point in the input
	string.  This makes Earley's parser ~top-down, as it keeps track of
	  rules, whereas ours is ~bottom-up in that it keeps track of variations
	  on the input string.

	  We will employ our general parser, with its capability for general
	  rewrite rules, in processing semantics as well as syntax.
	  As we shall see in Part 7, semantics will procede by generating
	  phrases in another language, phrases of length greater than one,
	  as though the results of general rewrite rules (e.g., figure 25.9).

	  Also, Section 25.5 will show that syntactical ambiguities in the
	  (first) language will give rise to ambiguous ~phrases in the second
	  language. Further ambiguities may be introduced within the second
	  language itself.  To have ~both kinds of ambiguity working together,
	  the ambiguity must be associated with the ~phrases and not the possible
	  rules that may be applicable.  Thus, to process ambiguity in semantics,
	  our general parser applies whereas Earley's might not.