3. Parsing terminology

30 May 2016 (edited Aug 2019)

In this article series I will be teaching not just how to use my parser generator, but broader knowledge such as: what kinds of parser generators are out there? What’s ambiguity and what do we do about it? What’s a terminal? So this article is a general discussion of parsing terminology and parsing techniques.

Parsing terminology

I’ll start with a short glossary of standard parsing terminology.

First of all, grammars consist of terminals and nonterminals.

A terminal is an item from the input; when you are defining a lexer, a terminal is a single character, and when you are defining a parser, a terminal is a token from the lexer. More specifically, the grammar is concerned only with the “type” of the token, not its value. For example one of your token types might be Number, and a parser cannot treat a particular number specially; so if you ever need to treat the number “0” differently than other numbers, then your lexer would have to create a special token type for that number (e.g. Zero), because the grammar cannot make decisions based on a token’s value. Note: in LLLPG you can circumvent this rule if you need to, but it’s not pretty.

A nonterminal is a rule in the grammar. So in this grammar:

token Spaces @{ (' '|'t')+ };
token Id @{
  ('a'..'z'|'A'..'Z'|'_')
  ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*
};
token Int @{ '0'..'9'+ };
token Token @{ Spaces | Id | Int };

The nonterminals are Spaces, Id, Int and Token, while the terminals are inputs like 's', 't', '9', and so forth.

Traditional literature about parsing assumes that there is a single “Start Rule” that represents the entire language. For example, if you want to parse C#, then the start rule would say something like “zero or more using statements, followed by zero or more namespaces and/or classes”:

rule Start @{ UsingStmt* TopLevelDecl* };
rule TopLevelDecl @{ NamespaceStmt | TypeDecl };
...

However, LLLPG is more flexible than that. It doesn’t limit you to one start rule; instead, LLLPG assumes you might start with any rule that isn’t marked private. After all, you might want to parse just a subset of the language, such as a method declaration, a single statement, or a single expression.

LLLPG rules express grammars in a somewhat non-standard way. Firstly, LLLPG notation is based on ANTLR notation, which itself is mildly different from EBNF (Extended Backus–Naur Form) notation which, in some circles, is considered more standard. Also, because LLLPG is embedded inside another programming language (LES or EC#), it uses the @{...} notation which means “token literal”. A token literal is a sequence of tokens that is not interpreted by the host language; the host language merely figures out the boundaries of the “token literal” and saves the tokens so that LLLPG can use them later. So while a rule in EBNF might look like this:

Atom = ['-'], (num | id);

the same rule in LLLPG looks like this:

rule Atom @{ ['-']? (num | id) };

LLLPG also has a ANTLR-flavored input mode, and if you’re using that then you’d write

Atom : ('-')? (num | id);

In EBNF, [Foo] means Foo is optional, whereas LLLPG expects Foo?, to express the same idea. Some versions of EBNF use {Foo} to represent a list of zero or more Foos, while LLLPG uses Foo* for the same thing. In LLLPG, like ANTLR, {...} represents a block of normal code (C#), which is why braces can’t represent a list.

In LLLPG 1.0 I decided to support a compromise between the ANTLR and EBNF styles: you are allowed to write [...]? or [...]* instead of (...)? or (...)*; the ? or * suffix is still required. Thus, square brackets indicate nullability–the idea that the region in square brackets may not consume any input.

Alternatives and grouping work the same way in LLLPG and EBNF (e.g. (a | b)), although LLLPG also has the notation (a / b) which I’ll explain later.

In addition, some grammar representations do not allow loops, or even optional items. For example, formal “four-tuple” grammars are defined this way. I discuss this further in my blog post, Grammars: theory vs practice.

A “language” is a different concept than a “grammar”. A grammar represents some kind of language, but generally there are many possible grammars that could represent the same language. The word “language” refers to the set of sentences that are considered valid; two different grammars represent the same language if they accept and reject the same inputs (or, looking at the matter in reverse, if you can generate the same set of “sentences” from both grammars). For example, the following four rules all represent a list of digits:

rule Digits1 @{ '0'..'9'+ };
rule Digits2 @{ '0'..'9'* '0'..'9' };
rule Digits3 @{ '0'..'9' | Digits3 '0'..'9' };
rule Digits4 @{ ('0'..'9'+)* };

If we consider each rule to be a separate grammar, the four grammars all represent the same language (a list of digits). But the grammars are of different types: Digits1 is an LL(1) grammar, Digits2 is LL(2), Digits3 is LALR(1), and I don’t know what the heck to call Digits4 (it’s highly ambiguous, and weird). Since LLLPG is an LL(k) parser generator, it supports the first two grammars, but can’t handle the other two; it will print warnings about “ambiguity” for Digits3 and Digits4, then generate code that doesn’t work properly. Actually, while Digits4 is truly ambiguous, Digits3 is actually unambiguous. However, Digits3 is “ambiguous in LL(k)”, meaning that it is ambiguous from the top-down LL(k) perspective (which is LLLPG’s perspective).

The word nullable means “can match nothing”. For example, '0'..'9'* is nullable because it successfully “matches” an input like “hello, world” by doing nothing; but '0'..'9'+ is not nullable, and will only match something that starts with at least one digit.

LL(k) versus the competition

LLLPG is in the LL(k) family of parser generators. It is suitable for writing both lexers (also known as tokenizers or scanners) and parsers, but not for writing one-stage parsers that combine lexing and parsing into one step. It is more powerful than LL(1) parser generators such as Coco/R.

LL(k) parsers, both generated and hand-written, are very popular. Personally, I like the LL(k) class of parsers because I find them intuitive, and they are intuitive because when you write a parser by hand, it is natural to end up with a mostly-LL(k) parser. But there are two other popular families of parser generators for computer languages, based on LALR(1) and PEGs:

Of course, I should also mention regular expressions, which probably the most popular parsing tool of all. However, while you can use regular expressions for simple parsing tasks, they are worthless for “full-scale” parsing, such as parsing an entire source file. The reason for this is that regular expressions do not support recursion; for example, the following rule is impossible to represent with a regular expression:

rule PairsOfParens @{ '(' PairsOfParens? ')' };

Because of this limitation, I don’t think of regexes as a “serious” parsing tool.

Other kinds of parser generators also exist, but are less popular. Note that I’m only talking about parsers for computer languages; Natural Language Processing (e.g. to parse English) typically relies on different kinds of parsers that can handle ambiguity in more flexible ways. LLLPG is not really suitable for NLP.

As I was saying, the main difference between LL(k) and its closest cousin, the PEG, is that LL(k) parsers use prediction and LL(k) grammars usually suffer from ambiguity, while PEGs do not use prediction and the definition of PEGs pretends that ambiguity does not exist because it has a clearly-defined system of prioritization.

“Prediction” means figuring out which branch to take before it is taken. In a “plain” LL(k) parser (without and-predicates), the parser makes a decision and “never looks back”. For example, when parsing the following LL(1) grammar:

public rule Tokens @{ Token* };
public rule Token  @{ Float | Id | ' ' };
token Float        @{ '0'..'9'* '.' '0'..'9'+ };
token Id           @{ IdStart IdCont* };
rule  IdStart      @{ 'a'..'z' | 'A'..'Z' | '_' };
rule  IdCont       @{ IdStart | '0'..'9' };

The Token method will get the next input character (known as LA0 or lookahead zero), check if it is a digit or ‘.’, and call Float if so or Id (or consume a space) otherwise. If the input is something like “42”, which does not match the definition of Float, the problem will be detected by the Float method, not by Token, and the parser cannot back up and try something else. If you add a new Int rule:

...
public rule Token @{ Float | Int | Id };
token Float       @{ '0'..'9'* '.' '0'..'9'+ };
token Int         @{ '0'..'9'+ };
token Id          @{ IdStart IdCont* };
...

Now you have a problem, because the parser potentially requires infinite lookahead to distinguish between Float and Int. By default, LLLPG uses LL(2), meaning it allows at most two characters of lookahead. With two characters of lookahead, it is possible to tell that input like “1.5” is Float, but it is not possible to tell whether “42” is a Float or an Int without looking at the third character. Thus, this grammar is ambiguous in LL(2), even though it is unambiguous when you have infinite lookahead. The parser will handle single-digit integers fine, but given a two-digit integer it will call Float and then produce an error because the expected ‘.’ was missing.

A PEG parser does not have this problem; it will “try out” Float first and if that fails, the parser backs up and tries Int next. There’s a performance tradeoff, though, as the input will be scanned twice for the two rules.

Although LLLPG is designed to parse LL(k) grammars, it handles ambiguity similarly to a PEG: if A|B is ambiguous, the parser will choose A by default because it came first, but it will also warn you about the ambiguity.

Since the number of leading digits is unlimited, LLLPG will consider this grammar ambiguous no matter how high your maximum lookahead k (as in LL(k)) is. You can resolve the conflict by combining Float and Int into a single rule:

public rule Tokens @{ Token* };
public rule Token  @{ Number | Id };
token Number       @{ '.' '0'..'9'+
                    | '0'..'9'+ ('.' '0'..'9'+)? };
token Id           @{ IdStart IdCont* };
...

Unfortunately, it’s a little tricky sometimes to merge rules correctly. In this case, the problem is that Int always starts with a digit but Float does not. My solution here was to separate out the case of “no leading digits” into a separate “alternative” from the “has leading digits” case. There are a few other solutions you could use, which I’ll discuss later in this article.

I mentioned that PEGs can combine lexing and parsing in a single grammar because they effectively support unlimited lookahead. To demonstrate why LL(k) parsers usually can’t combine lexing and parsing, imagine that you want to parse a program that supports variable assignments like x = 0 and function calls like x(0), something like this:

rule Expr    @{ Assign | Call | ... };
rule Assign  @{ Id Equals Expr };
rule Call    @{ Id LParen ArgList };
rule ArgList ...
...

If the input is received in the form of tokens, then this grammar only requires LL(2): the Expr parser just has to look at the second token to find out whether it is Equals (‘=’) or LParen (‘(‘) to decide whether to call Assign or Call. However, if the input is received in the form of characters, no amount of lookahead is enough! The input could be something like:

this_name_is_31_characters_long = 42;

To parse this directly from characters, 33 characters of lookahead would be required (LL(33)), and of course, in principle, there is no limit to the amount of lookahead. Besides, LLLPG is designed for small amounts of lookahead like LL(2) or maybe LL(4); a double-digit value is almost always a mistake. LL(33) could produce a ridiculously large and inefficient parser (I’m too afraid to even try it.)

In summary, LL(k) parsers are not as flexible as PEG parsers, because they are normally limited to k characters or tokens of lookahead, and k is usually small. PEGs, in contrast, can always “back up” and try another alternative when parsing fails. LLLPG makes up for this problem with “syntactic predicates”, which allow unlimited lookahead, but you must insert them yourself, so there is slightly more work involved and you have to pay some attention to the lookahead issue. In exchange for this extra effort, though, your parsers are likely to have better performance, because you are explicitly aware of when you are doing something expensive (large lookahead). I say “likely” because I haven’t been able to find any benchmarks comparing LL(k) parsers to PEG parsers, but I’ve heard rumors that PEGs are slower, and intuitively it seems to me that the memoization and retrying required by PEGs must have some cost, it can’t be free. Prediction is not free either, but since lookahead has a strict limit, the costs usually don’t get very high. Please note, however, that using syntactic predicates excessively could certainly cause an LLLPG parser to be slower than a PEG parser, especially given that LLLPG does not use memoization.

Let’s talk briefly about LL(k) versus LALR(1). Sadly, my memory of LALR(1) has faded since university, and while Googling around, I didn’t find any explanations of LALR(1) that I found really satisfying. Basically, LALR(1) is neither “better” nor “worse” than LL(k), it’s just different. As wikipedia says:

The LALR(k) parsers are incomparable with LL(k) parsers – for any j and k both greater than 0, there are LALR(j) grammars that are not LL(k) grammars and conversely. In fact, it is undecidable whether a given LL(1) grammar is LALR(k) for any k >= 0.

But I have a sense that most LALR parser generators in widespread use support only LALR(1), not LALR(k); thus, roughly speaking, LLLPG is more powerful than an LALR(1) parser generator because it can use multiple lookahead tokens. However, since LALR(k) and LL(k) are incompatible, you would have to alter an LALR(1) grammar to work in an LL(k) parser generator and vice versa.

Here are some key points about the three classes of parser generators.

LL(k)

LALR(1)

PEGs

Regular expressions

Next up

In the next article you’ll learn about the interesting features you can use in your LLLPG grammars.