Recursive Descent Parser

Compilers need to recover the structure of a program from its textual representation. This process is called parsing, meanwhile the algorithm used for this is called parser. Parser will read the program and then convert it to a tree structure. Some cases will store the tree explicitly, while the others will left it implicitly thus proceeding the compilation process on the fly and process it based on the tree traversal.

Parsing in language is based on context-free language. Context free language is an attempt to describe natural language mathematically. One of its usage is to define a formal notation of a language since natural language is a subject for many ambiguity.

There’s two major different parsing methods: top-down and bottom-up. Top down approach can be analogous to expanding a parse tree or derivation from the sentence symbol until the frontier matches the desired input string (or until it becomes clear that there is no way to make them match). The tree that is expanded depends on which productions are used to replace the non terminals in the frontier of the tree at each step of the expansion.The differences among top-down parsing methods are in the methods used to choose this production.

One of many branch of top-down parsing is LL(1) parsing or leftmost(parse) lookahead 1 symbol. This parsing will construct a table where the proper action can be looked up, based on the symbols on the top of the stack and at the beginning of the input. This parsing style is also one of the simple approach of doing top-down parsing. Its general rule is “don’t choose a production unless it at least has a chance of matching the next input symbol”. However there’s a trade-off for this approach: Not all Context Free Grammar (CFGs) can be handled by LL(1) parsing. In the LL(1) table generation algorithm, it will build the table successfully, or otherwise will said if it’s not a LL(1) language and what’s the reason.

The basic idea behind the LL(1) parse table is to find the set of terminal symbols that can appear at the beginning of a string expanded from a production. If the non terminal on top of the stack is A and the next input is a, and there are two productions:

  • A→α
  • A→β,

We choose A→α if α can expand to something beginning with a. The choice must be unique; if β also expands to a string beginning with a, the grammar is not LL(1) (LL(1) parsing won’t work, so we either have to change the grammar or use a different parsing algorithm). Constructing the LL(1) parse table requires some preliminary definitions and computations.The first is the set of nullable non terminals. Thus we will need more general approach to compute nullable string:

A string α in (V ∪Σ)∗ is nullable if α∗ ⇒ ε

We will give a rule set that can be applied iteratively. The order of the execution does not matter so long as it eventually applied. Suppose we call the function as Nullable with boolean return value. The domain of the function is all the strings consisting of single non terminal symbols, or right-hand-sides of productions. Initially, we assume that Nullable is false for all strings; the rules can be applied to set it to true for some strings, whereupon other rules can set it to true for other strings. The process terminates when no rule can change Nullable from false to true for any string.

Nullable(X1X2…Xn) = true if, for all 1≤i≤n, Nullable(Xi).

Nullable(A) =true if there is a production in the CFGA→α and Nullable(α).

In particular, rule 1 implies that ε is nullable, since for all i≤i≤n applies to no i, so all (zero) symbols are nullable.

The next function that needs to be computed is the First No ε Set (FNE set). FNE(α) is the set of terminal symbols which can appear at the beginning of a terminal string derived from α. The domain of FNE consists of the set of all non terminals and suffixes(i.e., tails) of strings appearing on the right-hand sides of productions, but the co-domain of FNE consists of the subsets of Σ (i.e., FNE(α) is a set of terminal symbols).

The last set that needs to be defined before constructing the LL(1) parse table is the FOLLOW set for each non terminal. The FOLLOW sets are only used for nullable productions.Recall that the LL(1) parse table selects a production based on the non terminal on top of the stack and the next input symbol. We want to choose a production that is consistent with the next input. FNE sets tell us what we want to know if the next terminal is going to be derived from the top non terminal on the stack (we just pick the production whose right-hand side has the next input in it’s FNE set). But, suppose the non terminal on top of the stack“expands” to ε! Then the next non terminal is going to come from some symbol after that non terminal. The FOLLOW sets say exactly which terminals can occur immediately after a non terminal.

The LL(1) parse table is a two-dimensional array, which we will call Table. The rows are indexed by non terminals (for what is on the top of the stack) and the columns by terminals(the next input symbol). LL(1) parsing demands that we be able to choose exactly the next production to expand based solely on whether it is consistent with the next input symbol. So the LL(1) parse table construction not only builds parse tables, it is a test for whether a CFG is LL(1) or not.

LL(1) parsing can be used with an automatic parser generator. It is also appropriate as a basis for a simple hand-written parsing style, called Recursive Descent. The idea behind recursive descent parsing is to write a collection of recursive functions, one associated with each non terminal symbol in the grammar. The function for a non terminal is responsible for reading and parsing the part of the input string that that non terminal expands to.

Recursive descent parsing is very widely used, because it requires no special parser generation tools, it can be extended in ad-hoc manner (for example, looking ahead to several inputs, or looking at other context, when the next input does not uniquely determine the production choice), and it allows the user great freedom in generating error messages and doing error recovery.

Let’s said we have a productions from a grammar which function started with FUNC reserved word.

program –> function_list

function_list –> function_list function | function

function –>FUNC identifier ( parameter_list ) statements

Thus we expect to first find the token FUNC, then which in turn expect an identifier (the name of the function), followed by an opening parenthesis, and so on. As it pulls each token from the parser, it must ensure that it matches the expected, and if not, will halt with an error. For each non terminal, this function calls the associated function to handle its part of the parsing.

So first, we will introduce a utility function used to verify the next expected token.

And then we will create a function based on above utility function:

Parse Tree

Production for an if­ statement in this language:

if_statement –> IF expression THEN statement ENDIF | IF expression THEN statement ELSE statement ENDIF

Then we must left­ factor to share the common parts:

if_statement –> IF expression THEN statement close_if close_if –> ENDIF | ELSE statement ENDIF

When parsing the closing portion of the if, we have to decide which of the two right­ hand side options to expand. In this case, it isn’t too difficult. We try to match the first token again ENDIF and on non-­match, we try to match the ELSE clause and if that doesn’t match, it will report an error. This is quite simple, how if we have many alternatives on the right hand side?

statement –> assg_statement | return_statement | print_statement | null_statement | if_statement | while_statement | block_of_statements

What about left ­recursive productions? Now we see why these are such a problem in a predictive parser. Consider this left­ recursive production that matches a list of one or more functions.

function_list –> function_list function | function // first rule

function –>FUNC identifier ( parameter_list ) statement // second rule

This will causing infinite loop… Thus we will need to remove the left-recursion. The first rule will be simplified as:

function_list –> function function_list | function

And then we will need to create functions for common part:

function_list –> function more_functions

more_functions –> function more_functions | ε

C’est tout. J’espère que vous en profiterez. 📚

Machine Learning, Natural Language Processing, and Open Source.