# Appendix: FullLLk versus "approximate" LL(k)

30 May 2016

First, the short version: try adding `[FullLLk(true)]` to your grammar if you suspect prediction isn’t working perfectly.

Now, it’s a bit difficult to explain how LLLPG generates a prediction tree without invoking all sorts of math-speak that, if you are like me, would make your head hurt. It is easier to explain with examples. Let’s start simple:

``````rule Comparison @{ '>' '=' | '<' '=' | '=' '=' | '>' | '<' };

void Comparison()
{
int la0, la1;
la0 = LA0;
if (la0 == '>') {
la1 = LA(1);
if (la1 == '=') {
Skip();
Skip();
} else
Skip();
} else if (la0 == '<') {
la1 = LA(1);
if (la1 == '=') {
Skip();
Skip();
} else
Skip();
} else {
Match('=');
Match('=');
}
}
``````

Roughly what happens here is that LLLPG

1. Finds the first set for each arm: {‘>’} for 1 and 4, {‘<’} for 2 and 5, {‘=’} for 3.
2. Finds a common subset between arm 1 and the others. In this case it finds {‘>’}, common between 1 and 4.
3. Generates the `if (la0 == '>') {...}` statement and then generates an inner prediction tree based on the knowledge that la0 is in the set {‘>’}, which excludes arms 2, 3 and 5.
4. Knowing that `la0 != '>'`, it excludes arms 1 and 4, then looks for another common subset and finds {‘<’}, common between 2 and 5.
5. Generates the `if (la0 == '<') {...}` statement and then generates an inner prediction tree based on the knowledge that la0 is in the set {‘<’}, which additionally excludes arm 3.
6. Only one arm is left, arm 3, and this becomes the else branch.

Note: the generated code is correct, although this example is unusual because arm 3 ends up acting as if it were the `default` branch. The code will change if you explicitly mark the last arm as the `default`, or if you add an `error` branch.

Here’s another example:

``````rule ABCD @{ (A B | C D) {} | A D };

void ABCD()
{
int la0, la1;
do {
la0 = LA0;
if (la0 == A) {
la1 = LA(1);
if (la1 == B)
goto match1;
else {
Skip();
Match(D);
}
} else
goto match1;
break;
match1: { ... } // omitted for brevity
} while (false);
}
``````

Note: `{}` forces LLLPG to create two prediction trees instead of one, see “A random fact” from the previous article.

I’m using this example because it was mentioned by Terrance Parr as something that ANTLR 2 couldn’t handle. LLLPG has no problem; to generate the outer prediction tree, LLLPG

1. Finds the first set for each arm: {`A`,`B`} for 1, {`A`} for 2.
2. Finds a common subset between arm 1 and the others. In this case it finds {`A`}.
3. Generates the `if (la0 == A) {...}` statement and then generates an inner prediction tree (for `LA(1)`) based on the knowledge that la0 is in the set {`A`}, which excludes the inner arm `C D` of `(A B | C D)`.
4. Knowing that `la0 != A`, it excludes arms 2, leaving only arm 1, and this becomes the `else` branch.

Note: I’ve been speaking as though LLLPG generates code during prediction, but it doesn’t. Instead there is an abstract intermediate representation for prediction trees, and the C# code is only generated after analysis and prediction is complete.

I didn’t realize it at first, but LLLPG’s technique doesn’t support all LL(k) grammars. It is more powerful than the Linear Approximate Lookahead of ANTLR 2, but some cases still don’t work, like this one:

``````LLLPG (lexer)
{
[LL(3)]
token Token    @{ Number | Operator | ' ' };
token Operator @{ '+'|'-'|'*'|'/'|'.' };
token Number   @{ '-'? '.'? '0'..'9'+ };
}
``````

After (correctly) warning that `Alternatives (1, 2) are ambiguous for input such as «'-' '.' 0» ([\-.], [.0-9], ~())`, LLLPG generates this slightly incorrect code for `Token`:

``````void Token()
{
int la1;
switch (LA0) {
case '-': case '.':
{
la1 = LA(1);
if (la1 == '.' || la1 >= '0' && la1 <= '9')
Number();
else
Operator();
}
break;
case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9':
Number();
break;
case '*': case '+': case '/':
Operator();
break;
default:
Match(' ');
break;
}
}
``````

To choose this code, LLLPG

1. Finds the first set for each arm: {`'+','-','*','/','.'`} for 1, {`'-','.','0'..'9'`} for 2 and {`' '`} for 3.
2. Finds a common subset between arm 1 and the others. In this case it finds {`'-','.'`}.
3. Generates the `case '-': case '.': {...}` block and then generates an inner prediction tree (for `LA(1)`) based on the knowledge that `LA0` is in the set {`'-','.'`}, which excludes only the possibility of arm 3 (`' '`).
• LLLPG computes the second sets which (keeping in mind that `LA0` is `'-'` or `'.'`) are {`'.','0'..'9'`} for arm 1 and `_` (any character) for arm 2.
• It finds the common subset, {`'.','0'..'9'`}
• It generates the `if (la1 == '.' || la1 >= '0' && la1 <= '9')` and generates an inner prediction tree for `LA(2)`.
• `LA(2)` can be anything `_` for both rules, so LLLPG reports an ambiguity between 1 and 2 and chooses 1 (`Number()`) as it has higher priority because it is listed first. - Knowing that LA(1) is not in the set {`'.','0'..'9'`}, it excludes arms 1, leaving only arm 2, and this becomes the `else` branch.
4. It generates the other cases, which are easy to understand so I’ll skip them.

Now, why is the generated code wrong? It’s wrong in the case of the input string “`-. `”, which should match `Operator` but instead matches `Number`. To fix this, I added a finer-grained analysis that is enabled by the `[FullLLk]` option.

``````[LL(3)] [FullLLk]
token Token    @{ Number | Operator | ' ' };
``````

This analysis realizes that, due to the relatively complex substructure of `Number`, it should split `'-'` and `'.'` into two separate cases.

When analyzing the case `la0 == '-'`, the set for `LA(1)` for arm 1 is still {`'.','0'..'9'`}, but LLLPG further figures out that it should split the analysis of `LA(1)` into separate subtrees for `'.'` and `'0'..'9'`. In the subtree where `la0 == '-'` and `la1 == '.'`, LLLPG is able to figure out that `Number` should only be invoked if `LA(2)` is `'0'..'9'`. It is able to figure this out now because the information `la0 == '-' && la1 == '.'` is more specific than the information it had without `[FullLLk]` (i.e. `(la0 == '-' || la0 == '.') && (la1 == '.' || la1 >= '0' && la1 <= '9')`).

So after the more detailed analysis of `[FullLLk]`, the output code becomes

``````void Token()
{
int la1, la2;
switch (LA0) {
case '-':
{
la1 = LA(1);
if (la1 == '.') {
la2 = LA(2);
if (la2 >= '0' && la2 <= '9')
Number();
else
Operator();
} else if (la1 >= '0' && la1 <= '9')
Number();
else
Operator();
}
break;
case '.':
{
la1 = LA(1);
if (la1 >= '0' && la1 <= '9')
Number();
else
Operator();
}
break;
case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9':
Number();
break;
case '*': case '+': case '/':
Operator();
break;
default:
Match(' ');
break;
}
}
``````

You still get the ambiguity warning, though. Use a slash to suppress the warning: `Number / Operator | ' '`.

Full LL(k) mode doesn’t always work perfectly, and may make LLLPG run slower, which is why it is not enabled by default. But usually, it works fine and you can safely apply it to your entire grammar.

In certain cases, LLLPG reports an ambiguity that doesn’t actually exist in a grammar without the `[FullLLk]` option. One example is given by this blog post that I wrote while writing the EC# grammar. So if you can’t figure out where an ambiguity comes from, try `[FullLLk]`. If you still get the same ambiguity warning after enabling Full LL(k), check over your grammar carefully, because it is probably genuinely ambiguous.