This page has moved to ecsharp.net.
GitHub doesn't support HTTP redirects, so you'll be redirected via JavaScript in 5 seconds.LLLPG Part 4: Managing Ambiguity + API reference
25 Feb 2014 (updated 22 May 2016)This page is obsolete
The LLLPG manual has been reorganized. These old articles may be deleted in the future.
Welcome to part 4
New to LLLPG? Start at part 1.
Part 4 is all about nitty-gritty details: how prediction works, a discussion of ambiguity (what it is, common ambiguous situations, and how to deal with them), and a list of APIs that LLLPG calls in generated code.
FullLLk versus “approximate” LL(k)
First, the short version: try adding [FullLLk(true)]
to your grammar if you suspect prediction isn’t working perfectly.
Now, it’s a bit difficult to explain how LLLPG generates a prediction tree without invoking all sorts of math-speak that, if you are like me, would make your head hurt. It is easier to explain with examples. Let’s start simple:
rule Comparison @[ '>' '=' | '<' '=' | '=' '=' | '>' | '<' ];
void Comparison()
{
int la0, la1;
la0 = LA0;
if (la0 == '>') {
la1 = LA(1);
if (la1 == '=') {
Skip();
Skip();
} else
Skip();
} else if (la0 == '<') {
la1 = LA(1);
if (la1 == '=') {
Skip();
Skip();
} else
Skip();
} else {
Match('=');
Match('=');
}
}
Roughly what happens here is that LLLPG
- Finds the first set for each arm: {‘>’} for 1 and 4, {‘<’} for 2 and 5, {‘=’} for 3.
- Finds a common subset between arm 1 and the others. In this case it finds {‘>’}, common between 1 and 4.
- Generates the
if (la0 == '>') {...}
statement and then generates an inner prediction tree based on the knowledge that la0 is in the set {‘>’}, which excludes arms 2, 3 and 5. - Knowing that
la0 != '>'
, it excludes arms 1 and 4, then looks for another common subset and finds {‘<’}, common between 2 and 5. - Generates the
if (la0 == '<') {...}
statement and then generates an inner prediction tree based on the knowledge that la0 is in the set {‘<’}, which additionally excludes arm 3. - Only one arm is left, arm 3, and this becomes the else branch.
Note: the generated code is correct, although this example is unusual because arm 3 ends up acting as if it were the default
branch. The code will change if you explicitly mark the last arm as the default
, or if you add an error
branch.
Here’s another example:
rule ABCD @[ (A B | C D) {} | A D ];
void ABCD()
{
int la0, la1;
do {
la0 = LA0;
if (la0 == A) {
la1 = LA(1);
if (la1 == B)
goto match1;
else {
Skip();
Match(D);
}
} else
goto match1;
break;
match1: { ... } // omitted for brevity
} while (false);
}
Note: {}
forces LLLPG to create two prediction trees instead of one, see “A random fact” from the previous article.
I’m using this example because it was mentioned by Terrance Parr as something that ANTLR 2 couldn’t handle. LLLPG has no problem; to generate the outer prediction tree, LLLPG
- Finds the first set for each arm: {
A
,B
} for 1, {A
} for 2. - Finds a common subset between arm 1 and the others. In this case it finds {
A
}. - Generates the
if (la0 == A) {...}
statement and then generates an inner prediction tree (forLA(1)
) based on the knowledge that la0 is in the set {A
}, which excludes the inner armC D
of(A B | C D)
. - Knowing that
la0 != A
, it excludes arms 2, leaving only arm 1, and this becomes theelse
branch.
Note: I’ve been speaking as though LLLPG generates code during prediction, but it doesn’t. Instead there is an abstract intermediate representation for prediction trees, and the C# code is only generated after analysis and prediction is complete.
I didn’t realize it at first, but LLLPG’s technique doesn’t support all LL(k) grammars. It is more powerful than the Linear Approximate Lookahead of ANTLR 2, but some cases still don’t work, like this one:
LLLPG (lexer)
{
[LL(3)]
token Token @[ Number | Operator | ' ' ];
token Operator @[ '+'|'-'|'*'|'/'|'.' ];
token Number @[ '-'? '.'? '0'..'9'+ ];
}
After (correctly) warning that Alternatives (1, 2) are ambiguous for input such as «'-' '.' 0» ([\-.], [.0-9], ~())
, LLLPG generates this slightly incorrect code for Token
:
void Token()
{
int la1;
switch (LA0) {
case '-': case '.':
{
la1 = LA(1);
if (la1 == '.' || la1 >= '0' && la1 <= '9')
Number();
else
Operator();
}
break;
case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9':
Number();
break;
case '*': case '+': case '/':
Operator();
break;
default:
Match(' ');
break;
}
}
To choose this code, LLLPG
- Finds the first set for each arm: {
'+','-','*','/','.'
} for 1, {'-','.','0'..'9'
} for 2 and {' '
} for 3. - Finds a common subset between arm 1 and the others. In this case it finds {
'-','.'
}. - Generates the
case '-': case '.': {...}
block and then generates an inner prediction tree (forLA(1)
) based on the knowledge thatLA0
is in the set {'-','.'
}, which excludes only the possibility of arm 3 (' '
).- LLLPG computes the second sets which (keeping in mind that
LA0
is'-'
or'.'
) are {'.','0'..'9'
} for arm 1 and_
(any character) for arm 2. - It finds the common subset, {
'.','0'..'9'
} - It generates the
if (la1 == '.' || la1 >= '0' && la1 <= '9')
and generates an inner prediction tree forLA(2)
.
LA(2)
can be anything_
for both rules, so LLLPG reports an ambiguity between 1 and 2 and chooses 1 (Number()
) as it has higher priority because it is listed first. - Knowing that LA(1) is not in the set {'.','0'..'9'
}, it excludes arms 1, leaving only arm 2, and this becomes theelse
branch.
- LLLPG computes the second sets which (keeping in mind that
- It generates the other cases, which are easy to understand so I’ll skip them.
Now, why is the generated code wrong? It’s wrong in the case of the input string “-.
”, which should match Operator
but instead matches Number
. To fix this, I added a finer-grained analysis that is enabled by the [FullLLk]
option.
[LL(3)] [FullLLk]
token Token @[ Number | Operator | ' ' ];
This analysis realizes that, due to the relatively complex substructure of Number
, it should split '-'
and '.'
into two separate cases.
When analyzing the case la0 == '-'
, the set for LA(1)
for arm 1 is still {'.','0'..'9'
}, but LLLPG further figures out that it should split the analysis of LA(1)
into separate subtrees for '.'
and '0'..'9'
. In the subtree where la0 == '-'
and la1 == '.'
, LLLPG is able to figure out that Number
should only be invoked if LA(2)
is '0'..'9'
. It is able to figure this out now because the information la0 == '-' && la1 == '.'
is more specific than the information it had without [FullLLk]
(i.e. (la0 == '-' || la0 == '.') && (la1 == '.' || la1 >= '0' && la1 <= '9')
).
So after the more detailed analysis of [FullLLk]
, the output code becomes
void Token()
{
int la1, la2;
switch (LA0) {
case '-':
{
la1 = LA(1);
if (la1 == '.') {
la2 = LA(2);
if (la2 >= '0' && la2 <= '9')
Number();
else
Operator();
} else if (la1 >= '0' && la1 <= '9')
Number();
else
Operator();
}
break;
case '.':
{
la1 = LA(1);
if (la1 >= '0' && la1 <= '9')
Number();
else
Operator();
}
break;
case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9':
Number();
break;
case '*': case '+': case '/':
Operator();
break;
default:
Match(' ');
break;
}
}
You still get the ambiguity warning, though. Use a slash to suppress the warning: Number / Operator | ' '
.
Full LL(k) mode doesn’t always work perfectly, and may make LLLPG run slower, which is why it is not enabled by default. But usually, it works fine and you can safely apply it to your entire grammar.
In certain cases, LLLPG reports an ambiguity that doesn’t actually exist in a grammar without the [FullLLk]
option. One example is given by this blog post that I wrote while writing the EC# grammar. So if you can’t figure out where an ambiguity comes from, try [FullLLk]
. If you still get the same ambiguity warning after enabling Full LL(k), check over your grammar carefully, because it is probably genuinely ambiguous.
Ambiguity: introduction
In the context of a parser generator, Ambiguity refers to the situation in which, for a particular grammar, there can exist more than one potential parse tree for an input. Programming languages are designed to be less ambiguous than human languages, but their grammars generally do have ambiguities anyway. Some ambiguities are normal and expected and there is a standard “solution” to the problem, while others may be unique to the language you are parsing.
There are various kinds of ambiguities that I will illustrate by way of actual published newspaper headlines…
“Milk Drinkers are Turning to Powder.” “Red tape holds up new bridge.” “Kids Make Nutritious Snacks.” “Prostitutes Appeal to Pope.” “Stolen painting Found by Tree.” These sentences are ambiguous because of particular words or phrases that have multiple meanings: “turning”, “holds up”, “red tape”, “make”, “appeal” and “by”. Ambiguities of this sort are easily avoided by giving words and symbols only one meaning. You can safely define multiple meanings, though, if you define a unique structure or context for each meaning, e.g. in C# there are two using
statements, but in one case using
is always followed by (
and in the other case it is never followed by (
. Plus, one only appears at the beginning of a file and the other never appears at the beginning of a file.
“Squad Helps Dog Bite Victim.” “Complaints About NBA Referees Growing Ugly” “Quarter of a Million Chinese Live on Water” A lot of ambiguities in newspaper headlines are caused by the omission of “is” or “are” or a definite or indefinite article (the, a, an), or by the reader’s expectation of an omitted word. The problem exists only in certain languages such as English, and is easily avoided in programming languages by including sufficient redundancy to prevent ambiguity; for example, if the semicolons between the clauses of a “for” loop were not required, a loop like for (i = j + 1; -i > k; ++i)
would become ambiguous because for (i = j + 1 -i > k ++i)
could be parsed in several ways.
“Hospitals are Sued by 7 Foot Doctors.” “Hershey Bars Protest.” In these cases, the sentences permit different parts of speech for certain words, which allows different sentences structures to emerge, causing ambiguity. In programming languages, an example of a similar ambiguity arises if you define
- An operator that can be prefix or infix, e.g.
-
:-x
orx-y
- A different operator that can be suffix or infix, e.g.
*
:x*
orx*y
These operators are perfectly unambiguous by themselves; a problem arises only when you combine them in a certain way. Specifically, the expression x * - y
is ambiguous, as it could be parsed x * (- y)
or (x *) - y
. For this reason, if a real-life programming language contains a suffix operator, that operator is usually not an infix operator also. C++, somewhat famously, violates this rule with the pointer/multiplication operator *
: X * Y;
could be interpreted in two ways: it may define a variable of type X*
named Y
, or it may invoke an overloaded operator*
function (just as cout >> Y;
is a valid statement that calls operator>>
). C# doesn’t have exactly the same problem because, for example, cout >> Y;
is always illegal in C#; however, you can tell that the C# parser understands the expression because it reports the following error:
error CS0201: Only assignment, call, increment, decrement, await, and new object expressions can be used as a statement
In order to report this error and only this error, the C# parser still detects that cout >> Y
is a valid expression. You can see this because the error messages change for other invalid input like x */ Y;
or << hello there >>;
. Thus, in C# the same ambiguity exists at the parser level and is solved in a similar manner as in C++, by giving higher priority to the “variable declaration” interpretation than the “expression” interpretation.
“Include your Children When Baking Cookies.” Here, the sentence is unambiguous from a parsing perspective, but has incomplete information: there are two different ways the children could be “included” in the process and I’m not sure I would trust a computer to choose an interpretation! This is not a parsing issue; the sentence can be parsed unambiguously, yet its meaning is ambiguous. I suppose an analogy in computer languages would be this C# code:
class A { public virtual void F(Foo x) {...} }
class B : A {
public override void F(Foo x) {...} // first
public void F(Bar x) {...} // second
}
If I write new B().F(new Foo())
, will the compiler call the first or second method? In fact, under certain circumstances, a C# compiler will call the second method (brownie points go to the first person to explain why).
“Enraged Cow Injures Farmer With an Axe.” Here, the issue is that the “With” clause can attach to either “Farmer” or “Injures” (or is it “Cow”?) In computer languages, this type of problem is generally solved precedence rules and parenthesis. In LL(k) parsers, precedence rules are typically expressed by creating a rule for every precedence level, plus an innermost rule for parenthesis, identifiers and literals, which I like to call Atom
. So if you want these four precedence levels:
- Primary:
a.b
,f(x)
- Prefix:
-a
- multiply/divide:
a*b
,a/b
- add/subtract:
a+b
,a-b
Then you will need five rules that typically have the following form (plus a bunch of {actions}
that I left out):
// (before this you'll need to define some aliases for '.', '+', '-', etc.)
rule Atom() @[ TT.Id | TT.Num | '(' Expr ')' ];
rule PrimaryExpr() @[ Atom [ '.' Atom | '(' Expr ')' ]* ];
rule PrefixExpr() @[ '-' Atom | Atom ];
rule MulExpr() @[ PrefixExpr [ ('*'|'/') PrefixExpr ]* ];
rule Expr() @[ MulExpr [ ('+'|'-') MulExpr ]* ];
If you write the grammar in a single rule like this:
rule Expr() @
[ Expr ('*'|'/') Expr
| Expr ('+'|'-') Expr
| '-' Expr
| TT.Id | TT.Num | '(' Expr ')'
];
The grammar is not only ambiguous, but also left-recursive (an Expr
can start with an Expr
), which LLLPG is completely unable to handle. If you use two rules, like this:
rule Atom() @[ TT.Id | TT.Num | '(' Expr ')' ];
rule Expr() @[
( Atom | '-' Atom )
[ ('*'|'/') Atom
| ('+'|'-') Atom
]*
];
The grammar is not ambiguous, but will parse expressions like a cheap calculator (so that 2+3*4 = 20
) instead of a scientific calculator (2+3*4 = 14
). Also, - - x
cannot be parsed since -
is to be followed by Atom
(if you write '-' Expr
, the grammar is ambiguous and the parser will have strange behavior, parsing 2 * -3 + 4
like 2 * -(3 + 4)
because LLLPG parses greedily by default).
In traditional LL(k) parsers, you must define a separate rule for every precedence level, and in EC# there are 22 levels. It is possible to collapse many levels into a single rule, though, and in the next article I will describe how.
Tip: when writing your first parser, write it without any important {actions}
, i.e. don’t create a syntax tree at first. Just focus on making a recognizer, like the examples above, that simply scans through the input without interpreting it.
Managing ambiguity, part 1: token
rules
By declaring a token using token
instead of rule
, you’re asking LLLPG to simplify its analysis while avoiding warnings about a certain type of ambiguity.
A lexer separates a text document into a sequence of tokens, so it could be written like this:
public struct Tok { ... }
LLLPG(lexer)
{
public rule List<Tok> Start @[
{List<Token> ts;} ts+=Token* EOF {return ts;}
];
token Tok Token() {
// BTW: Instead of writing a @[...] block with {...} actions inside,
// LLLPG lets you write a {...} block with @[...] blocks inside. But
// currently the @[...] blocks must be at the top level of the method,
// not nested inside anything else such as a try {...} region.
Token t;
@[ t=Spaces | t=Id | t=Int | t=Op ];
return t;
}
token Tok Spaces @[ (' '|'\t')+ {return new Token(...);} ];
token Tok Id @[ ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9')+
{return new Token(...);} ];
token Tok Int @[ ('0'..'9')+ {return new Token(...);} ];
token Tok Op @[ op:=('+'|'-'|'*'|'/'|'=')+
{return new Token(...);} ];
}
The thing is, any lexer grammar that follows this pattern is ambiguous! Because of the loop in Start
, an identifier like “ab3
” could theoretically be parsed in four different ways: as a single Id “ab3
”, as two Ids “a
” “b3
”, as two tokens “ab
” “3
”, or as three tokens “a
” “b
” “3
”.
So if all the rules are written using rule
rather than token
, LLLPG will report three ambiguities, one each for Id
or Num
and Spaces
:
- Warning: (12,23): Loyc.LLParserGenerator.Macros.run_LLLPG: Alternatives (1, exit) are ambiguous for input such as « *» ([\t ], [$\t *+-/-9=A-Za-z])
- Warning: (13,43): Loyc.LLParserGenerator.Macros.run_LLLPG: Alternatives (1, exit) are ambiguous for input such as «0*» ([0-9A-Za-z], [$\t *+-/-9=A-Za-z])
- Warning: (15,23): Loyc.LLParserGenerator.Macros.run_LLLPG: Alternatives (1, exit) are ambiguous for input such as «0*» ([0-9], [$\t *+-/-9=A-Za-z])
LLLPG detects the ambiguity while looking at the follow set of each rule, which it does while analyzing the loops. Since Token
appears in a loop, Id
can theoretically be followed by another Id
, or by Num
, so the location where an Id
or Num
or Spaces
loop should stop is ambiguous.
I created token
mode to avoid warnings like this and the potentially complex analysis that produces them. token
replaces the follow set of a rule with _*
(i.e. anything), and then suppresses the inevitable ambiguity warning (because a decision between _* and anything else always ambiguous.) The mode is called token
since it is useful in the context of a lexer, but occasionally it is useful in parsers too.
By convention, when I write a lexer, I mark the top-level token rules with token
, and I use rule
for the sub-rules that are called by token
s.
Managing ambiguity, part 2: LLLPG’s missing feature
In certain ambiguous cases, notably those in which some alternatives are prefixes of others, some parser generators (including ANTLR) have the ability to select the longest match automatically. LLLPG does not have this ability, and you’ll notice this problem when writing rules for operators:
token CompareOp() @[ '>' | '<' | '=' | ">=" | "<=" ];
The output is:
- Warning: (4,23): Loyc.LLParserGenerator.Macros.run_LLLPG: Alternatives (1, 2) are ambiguous for input such as «>=» ([>], [=])
- Warning: (4,23): Loyc.LLParserGenerator.Macros.run_LLLPG: Alternatives (1, 3) are ambiguous for input such as «<=» ([<], [=])
-
Warning: (4,23): Loyc.LLParserGenerator.Macros.run_LLLPG: Branches 2, 3 are unreachable.
void CompareOp() { MatchRange('<', '>'); }
Please excuse the strange numbering scheme in the error message: LLLPG actually interprets '>' | '<' | '=' | ">=" | "<="
as ('>' | '<' | '=') | ">=" | "<="
, with the three single characters unified into a set, and it turns out that ('>' | '<' | '=')
is equivalent to '<'..'>'
, which explains where MatchRange('<', '>')
comes from.
What’s going on here? Well, if the input is ‘<=
’, ‘<
’ matches just as well as ‘<=
’, and because it is listed first, LLLPG gives it higher priority. So ‘<=
’ is unreachable because ‘<
’ takes priority, and ‘>=
’ is unreachable because ‘>
’ takes priority. This is easily fixed by always listing longer operators first:
token CompareOp() @[ ">=" | "<=" | '>' | '<' | '=' ];
Keywords are even more tricky. Let’s say you have the keywords fn
, for
, if
and while
:
rule IdStartChar @[ 'a'..'z'|'A'..'Z'|'_' ];
rule Id @[ IdStartChar (IdStartChar|'0'..'9')* ];
[LL(6)] // Longest keyword plus one
token IdOrKeyword @[ "fn" | "for" | "if" | "while" | Id ];
Here I’ve given Id
lower priority than the keywords, which will usually work correctly. However, it won’t work correctly for an Id
prefixed by a keyword, such as form
, which of course will parse as for
followed by ‘m
’ as a separate identifier. There is a solution, which I’ll show you in the next article, but LLLPG does not solve the problem automatically.
Managing ambiguity, part 3: the slash operator
The slash operator suppresses the ambiguity warning between two or more alternatives. The warnings you saw for
token CompareOp() @[ ">=" | "<=" | '>' | '<' | '=' ];
can be suppressed by switching ‘|
’s to ‘/
’s:
token CompareOp() @[ ">=" / "<=" / '>' / '<' | '=' ];
’/
’ is transitive, so in this example, the ambiguity between ">="
and '>'
is suppressed and likewise between "<="
and '<'
. Note that the following will not suppress the warnings:
token CompareOp() @[ ">=" / "<=" | '>' / '<' | '=' ];
(Footnote: in fact this would have suppressed the warnings in earlier versions of LLLPG, but the logic has changed. I will describe only the new rules.)
Earlier I said that LLLPG “doesn’t care much about parenthesis”, so that, for instance,
rule Foo @[ ["AB" | "A" | "CD" | "C"]* ];
is equivalent to
rule Foo @[ [("AB" | "A") | ("CD" | "C")]* ];
That’s true, but as of version 1.1.0 it will now pay attention to the relationship between the slash operator and parenthesis. In
token CompareOp() @[ ">=" / "<=" | '>' / '<' | '=' ];
LLLPG is instructed to suppress warnings between ">=" / "<="
and '>' / '<'
, but that’s all. ‘/
’ has higher precedence than ‘|
’, so this is equivalent to
token CompareOp() @[ (">=" / "<=") | ('>' / '<') | '=' ];
And the ‘|
’ operator causes warnings between the two groups (">=" / "<="
and '>' / '<'
) to be permitted. If you write
token CompareOp() @[ ">=" | "<=" / '>' | '<' | '=' ];
no warnings are suppressed either, but if you now add parenthesis:
token CompareOp() @[ (">=" | "<=") / ('>' | '<') | '=' ];
then the warnings are suppressed, because there is now a slash separating ">="
/ '>'
and "<="
/ '<'
.
You may remember from Part 2 that slash is the alt-separator in PEGs. In LLLPG, the slash operator works similarly, but quite the same (in fact, /
always yields the same code as |
.) Here’s an example where a PEG would parse differently from LLLPG:
rule abc @[ ('a' / 'a' 'b') 'c' ];
In a PEG (correct me if I’m wrong), the first branch always takes priority and the second branch is unreachable. If the input is ac
, a PEG will not backtrack and try the 'a' 'b'
branch because the first branch was matched successfully. An LL(k) parser generator, however, performs prediction on the first k
characters, even if those characters are beyond the list of alternatives under consideration, and that means the 'c'
influences code generation, producing the following code (by default):
void abc()
{
int la1;
la1 = LA(1);
if (la1 == 'c')
Match('a');
else {
Match('a');
Match('b');
}
Match('c');
}
Managing ambiguity, part 4: greedy and nongreedy
LLLPG supports ‘greedy
’ and ‘nongreedy
’ loops and optional items. The ‘greedy
’
and ‘nongreedy
’ modes refer to the action you prefer to take in case of
ambiguity between an exit branch and another branch. Greedy is the default: it
means that if the input matches both a non-exit branch and an exit branch, the
non-exit branch should be taken. A typical greedy example is this rule for an
“if” statement:
private rule IfStmt @[
"if" "(" Expr ")" Stmt greedy("else" Stmt)?
];
private rule Stmt @[
IfStmt | OtherStmt | Expr ";" | ...
];
In this case, it is possible that the “if” statement is nested inside another “if” statement. Given that the input could be something like
if (expr) if (expr)
Stmt();
else
Stmt();
It is, in general, ambiguous whether to consume TT.Else Stmt
or to exit,
because the else clause could be paired with the first “if” or the second one.
The “greedy” modifier, which must always be paired with a loop or option
operator (* + ?
) means “in case of ambiguity with the exit branch, do not exit
and do not print a warning.” Since greedy behavior is the default, the greedy
modifier’s only purpose is to suppress the warning.
Now, you might logically think that changing ‘greedy
’ to ‘nongreedy
’ would
cause the ‘else
’ to match with the outer ‘if
’ statement rather than the inner
one. Unfortunately, that’s not what happens! It does not work because the code
generated for IfStmt
is not aware of the run-time call stack leading up to it:
it does not know whether it is nested inside another IfStmt
or not. LLLPG only
knows that it could be nested inside another ‘if
’ statement; the technical
jargon for this is that the follow set of the IfStmt
rule includes
TT.Else Stmt
.
What actually happens is that nongreedy(TT.Else Stmt)?
will never match
TT.Else
, and LLLPG will give you a warning that “branch 1 is unreachable”. Not
knowing the actual context in which IfStmt
was called, LLLPG is programmed to
assume that all possible follow sets of IfStmt
apply simultaneously, even
though in reality IfStmt
is called in one specific context. The statically
computed follow set of IfStmt
, which is based on all possible contexts where
IfStmt
might appear, includes TT.Else Stmt
, and nongreedy
uses this
information to decide, unconditionally, to let the exit branch win. To put it
another way, LLLPG behaves as if IfStmt
is always called from inside another
IfStmt
, when in reality it merely might be. It would be fairly difficult for
LLLPG to behave any other way; how is the IfStmt()
method supposed to know
call stack of other rules that called it?
By the way, I have the impression that the formal way of describing this
limitation of LLLPG’s behavior is to say that LLLPG supports only “strong”
LL(k) grammars, not “general” LL(k)
grammars (this is true even when you use FullLLk(true)
).
So at the end of a rule, LLLPG makes decisions based on all possible contexts
of that rule, rather than the actual context. Consequently, nongreedy
is not
as useful as it could be. However, nongreedy
still has its uses. Good
examples include strings and comments:
token TQString @[ "'''" (nongreedy(_))* "'''" ];
token MLComment @[ "/*" (nongreedy(MLComment / _))* "*/" ];
This introduces the single underscore _
, which matches any single terminal
(not including EOF).
The first example defines the syntax of triple-quoted strings '''like this
one'''
. The contents of the string are any sequence of characters except
"'''"
, which ends the string. The nongreedy
modifier is important; without
it, the loop (_)*
will simply consume all characters until end of file, and
then produce errors because the expected "'''"
was not found at EOF.
The second example for /* multi-line comments */
is similar, except that
(just for fun) I decided to support nested multi-line comments by calling the
MLComment
rule recursively.
There’s actually a bug in TQString
, assuming that LLLPG is left in its
default configuration. Moreover, LLLPG will not print a warning about it. Got
any idea what the bug is? I’m about to spoil the answer, so if you want to
give it some thought, do so now before you start glancing at the lower half of
this paragraph. Well, if you actually tested this code you might notice that a
string like '''one''two'''
will be parsed incorrectly, because two quotes,
not three, will cause the loop to exit. The reason is that the default maximum
lookahead is 2, so two quotes are enough to make LLLPG decide to exit the loop
(and then the third Match('\'')
in the generated code will fail). To fix
this, simply add a [k(3)]
attribute to the rule. No warning was printed
because half the purpose of nongreedy
is to suppress warnings; after all,
mixing (_)*
with anything else is inherently ambiguous and will frequently
cause a warning that you must suppress.
Earlier I ran into an unfortunate situation in which neither greedy
nor
nongreedy
was appropriate. I was writing a Visual Studio “classifier” for
syntax-highlighting of LES, and I decided to use a line-based design where
lexing would always start at the beginning of a line. Therefore, I needed to
keep track of which lines started inside multi-line comments and triple-quoted
strings. Now, if a line starts inside a comment or string, I invoke a special
rule that is designed to parse the rest of the comment or string, or stop at
the end of the line. Since LES supports nested multi-line comments, I wrote
the following rule:
// (LES code, so "nested::int" instead of "int nested")
public token MLCommentLine(ref nested::int)::bool @[
(nongreedy
( &{nested>0} "*/" {nested--;}
/ "/*" {nested++;}
/ ~('\r'|'\n')
))*
(Newline {return false;} | "*/" {return true;})
];
This rule takes the current comment nesting level as an argument (0 = comment is not nested) and updates the nesting level if it changes during the current line of code. The loop has three arms:
- For input of “*/” when comments are nested, reduce the nesting level
- For input of “/*”, increase the nesting level
- For input of anything else (not including a newline), consume one character.
I chose ‘nongreedy
’ because otherwise the third branch ~('\r'|'\n')
will
match the first character of “*/”, so the loop would never exit. But this
didn’t work; LLLPG gave the warning “branch 1 is unreachable”. Why is it
unreachable? I have to admit, I couldn’t figure it out at first. If you feel
like you’re stumped by LLLPG warnings sometimes, you’re not alone, they
sometimes confuse me too. In this case I was confused because I thought the
predicate &{nested>0}
would choose whether to stay in the loop or exit. But
in fact nongreedy
gives the exit branch a higher priority than the first
branch, so regardless of whether &{nested>0}
, LLLPG would always choose the
exit branch when the input is ‘*/
’.
At that point I realized that what I wanted was a loop that was neither greedy
nor nongreedy, in which the priority of the exit branch is somewhere in the
middle. I wanted to be able to write something like this, where ‘exit
’ is
higher priority than ~('\r'|'\n')
but lower priority than &{nested>0}
"*/"
:
public token MLCommentLine(ref nested::int)::bool @[
( &{nested>0} "*/" {nested--;}
/ "/*" {nested++;}
/ exit
/ ~('\r'|'\n')
)*
(Newline {return false;} | "*/" {return true;})
];
Unfortunately, LLLPG does not support this notation. Maybe in a future version. Here’s what I did instead:
public token MLCommentLine(ref nested::int)::bool @[
(greedy
( &{nested>0} "*/" {nested--;}
/ "/*" {nested++;}
/ ~('\r'|'\n'|'*')
/ '*' (&!'/')
))*
(Newline {return false;} | "*/" {return true;})
];
Here, I switched back to a greedy loop and added '*'
as its own branch with an
extra check to make sure '*'
is not followed by '/'
. If the test &!'/'
succeeds, the new fourth branch matches the '*'
character (but not the
character afterward); otherwise the loop exits. I could have also written it
like this, with only three branches:
public token MLCommentLine(ref nested::int)::bool @[
(greedy
( &{nested>0} "*/" {nested--;}
/ "/*" {nested++;}
/ (&!"*/") ~('\r'|'\n')
))*
(Newline {return false;} | "*/" {return true;})
];
However, this version is slower, because LLLPG will actually run the &!"*/"
test on every character within the comment.
Here’s one more example using nongreedy:
// Parsing a comma-separated value file (.csv)
public rule CSVFile @[ Line* ];
rule Line @[ Field greedy(',' Field)* (Newline | EOF) ];
rule Newline @[ ('\r' '\n'?) | '\n' ];
rule Field @[ nongreedy(_)*
| '"' ('"' '"' | nongreedy(~('\n'|'\r'))* '"' ];
This grammar describes a file filled with fields separated by commas (plus I
introduced the EOF symbol, so that no Newline is required at the end of the
last line). Notice that Field
has the loop nongreedy(_)*
. How does LLLPG
know to when to break out of the loop? Because it computes the “follow set” or
“return address” of each rule. In this case, ‘Field’ can be followed by
','|'\n'|'\r'|EOF
, so the loop will break as soon as one of these characters
is encountered. This is different than the IfStmt
example above in an
important respect: Field
always has the same follow set. Even though Field
is called from two different places, the follow set is the same in both
locations: ','|'\n'|'\r'|EOF
. So nongreedy
works reliably in this example
because it makes no difference which context Field
was called from.
Reference: APIs called by LLLPG
Here’s the list of methods that LLLPG expects to exist. The MatchRange
/MatchExceptRange
methods are only used in lexers though, and EOF
is only used in parsers (lexers refer to EOF as -1):
// Note: the set type is expected to contain a Contains(MatchType) method.
static HashSet<MatchType> NewSet(params MatchType[] items);
static HashSet<MatchType> NewSetOfRanges(params MatchType[] ranges);
LaType LA0 { get; }
LaType LA(int i);
static const LaType EOF;
void Error(int lookaheadIndex, string message);
// Normal matching methods
void Skip();
Token MatchAny();
Token Match(MatchType a);
Token Match(MatchType a, MatchType b);
Token Match(MatchType a, MatchType b, MatchType c);
Token Match(MatchType a, MatchType b, MatchType c, MatchType d);
Token Match(HashSet<MatchType> set);
Token MatchRange(int aLo, int aHi);
Token MatchRange(int aLo, int aHi, int bLo, int bHi);
Token MatchExcept();
Token MatchExcept(MatchType a);
Token MatchExcept(MatchType a, MatchType b);
Token MatchExcept(MatchType a, MatchType b, MatchType c);
Token MatchExcept(MatchType a, MatchType b, MatchType c, MatchType d);
Token MatchExcept(HashSet<MatchType> set);
Token MatchExceptRange(int aLo, int aHi);
Token MatchExceptRange(int aLo, int aHi, int bLo, int bHi);
// Used to verify and-predicates in the matching stage
void Check(bool expectation, string expectedDescr);
// For backtracking (used by generated Try_Xyz() methods)
struct SavePosition : IDisposable
{
public SavePosition(Lexer lexer, int lookaheadAmt);
public void Dispose();
}
// For recognizers (used by generated Scan_Xyz() methods)
bool TryMatch(MatchType a);
bool TryMatch(MatchType a, MatchType b);
bool TryMatch(MatchType a, MatchType b, MatchType c);
bool TryMatch(MatchType a, MatchType b, MatchType c, MatchType d);
bool TryMatch(HashSet<MatchType> set);
bool TryMatchRange(int aLo, int aHi);
bool TryMatchRange(int aLo, int aHi, int bLo, int bHi);
bool TryMatchExcept();
bool TryMatchExcept(MatchType a);
bool TryMatchExcept(MatchType a, MatchType b);
bool TryMatchExcept(MatchType a, MatchType b, MatchType c);
bool TryMatchExcept(MatchType a, MatchType b, MatchType c, MatchType d);
bool TryMatchExcept(HashSet<MatchType> set);
bool TryMatchExceptRange(int aLo, int aHi);
bool TryMatchExceptRange(int aLo, int aHi, int bLo, int bHi);
The following data types are parameters that you can change:
LaType
: the data type of LA0 and LA(i). This is alwaysint
in lexers, but in parsers you can use thelaType(...)
option (documented in the previous article) to change this type.MatchType
: the data type of arguments toMatch
,MatchExcept
,TryMatch
andTryMatchExcept
. In lexers,MatchType
is alwaysint
. In parsers, by default, LLLPG generates code as thoughMatchType
is same asLaType
, butBaseParser
usesint
instead for performance reasons. Consequently, when usingBaseParser
you need to use thematchType(int)
option to changeMatchType
toint
.HashSet<MatchType>
is the declared data type of large sets. By default this isHashSet<int>
but you can change it using thesetType(...)
option.Token
is the return value of theMatch
methods. LLLPG does not care and does not need to know what this type is. In lexers, these methods should return the character that was matched, and in parsers they should return the token that was matched (if the match fails, BaseLexer and BaseParser still return the character or token, whatever it was.)
And now, here’s a brief description of the APIs, with examples.
NewSet, NewSetOfRanges
static HashSet<MatchType> NewSet(params MatchType[] items);
static HashSet<MatchType> NewSetOfRanges(params MatchType[] ranges);
These are used for large sets, when it would be inappropriate to generate an expression or Match
call.
// Example:
LLLPG(lexer) {
rule Vowel @[ 'a'|'e'|'i'|'o'|'u'|'A'|'E'|'I'|'O'|'U' ];
rule MaybeHexDigit @[ ['0'..'9'|'a'..'f'|'A'..'F']? ];
};
// Generated code:
static readonly HashSet<int> Vowel_set0 = NewSet(
'A', 'E', 'I', 'O', 'U', 'a', 'e', 'i', 'o', 'u');
void Vowel()
{
Match(Vowel_set0);
}
static readonly HashSet<int> MaybeHexDigit_set0 =
NewSetOfRanges('0', '9', 'A', 'F', 'a', 'f');
void MaybeHexDigit()
{
int la0;
la0 = LA0;
if (MaybeHexDigit_set0.Contains(la0))
Skip();
}
LA0, LA(i)
LaType LA0 { get; }
LaType LA(int i);
LLLPG assumes that there is a state variable somewhere that tracks the “current input position”; the current position is usually called InputPosition
but LLLPG never refers to it directly. LA0
returns the character or token at the current position, and LA(i)
returns the character or token at InputPosition + i
.
Obviously, a single function LA(i)
would have been enough, but LA(0)
is used much more often than LA(i)
so I decided to define an extra API which gives implementations an opportunity to optimize access to LA0
. But in case LA0
and LA(i)
are nontrivial, LLLPG also caches the value of LA0
or LA(i)
in a local variable.
// Example:
LLLPG(parser) {
token OptionalIndefiniteArticle @[ ('a' 'n' / 'a')? ];
};
// Generated code:
void OptionalIndefiniteArticle()
{
int la0, la1;
la0 = LA0;
if (la0 == 'a') {
la1 = LA(1);
if (la1 == 'n') {
Skip();
Skip();
} else
Skip();
}
}
EOF (parsers only)
static const LaType EOF;
Occasionally LLLPG needs to check for EOF. For example, the default follow set of a rule is EOF, and when using NoDefaultArm
, LLLPG may check whether LA0==EOF to see if an error occurred.
[NoDefaultArm] LLLPG(parser) {
rule AllBs @[ 'B'* ];
};
void AllBs()
{
int la0;
for (;;) {
la0 = LA0;
if (la0 == 'B')
Skip();
else if (la0 == EOF)
break;
else
Error(0, "In rule 'MaybeB', expected one of: ('B'|EOF)");
}
}
In lexers, LLLPG uses -1
instead of EOF
.
Error(i, msg)
void Error(int lookaheadIndex, string message);
This method is called by the default error branch with an auto-generated message, as shown in the example above. lookaheadIndex
is the offset (LA(lookaheadIndex)
) where the unexpected character or token was encountered (usually 0). Currently, the error message cannot be customized.
Skip(), MatchAny()
void Skip();
Token MatchAny();
Both of these methods advance the current position by one character or token. Skip()
is called when the return value will not be used, while MatchAny()
is called if the return value is saved.
// Example
LLLPG(lexer) {
rule WhateverB @[ (_|EOF) [b='B']? ];
}
// Generated code
void WhateverB()
{
int la0;
Skip();
la0 = LA0;
if (la0 == 'B')
b = MatchAny();
}
Match
Token Match(MatchType a);
Token Match(MatchType a, MatchType b);
Token Match(MatchType a, MatchType b, MatchType c);
Token Match(MatchType a, MatchType b, MatchType c, MatchType d);
Token Match(HashSet<MatchType> set);
Ensures that LA0
matches the argument(s) given to Match
, taking any appropriate action (printing an error message or throwing an exception) if LA0
does not match the argument(s). Then LA0
is “consumed”, meaning that the input position is increased by one. LLLPG does not care about the return type, but the return value is used in expressions like zero:='0'
(see example).
// Example
LLLPG(lexer) {
rule FiveEvenDigits @[
zero:='0' ('0'|'2') ('0'|'2'|'4') ('0'|'2'|'4'|'6') ('0'|'2'|'4'|'6'|'8')
];
}
// Generated code
static readonly HashSet<int> FiveEvenDigits_set0 = NewSet('0', '2', '4', '6', '8');
void FiveEvenDigits()
{
var zero = Match('0');
Match('0', '2');
Match('0', '2', '4');
Match('0', '2', '4', '6');
Match(FiveEvenDigits_set0);
}
MatchRange (lexers only)
Token MatchRange(int aLo, int aHi);
Token MatchRange(int aLo, int aHi, int bLo, int bHi);
Matches LA0
against a range of characters, then increases the input position by one.
// Example
LLLPG(lexer) {
rule LetterDigit @[ ('a'..'z'|'A'..'Z') '0'..'9' ];
}
// Generated code
void LetterDigit()
{
MatchRange('A', 'Z', 'a', 'z');
MatchRange('0', '9');
}
MatchExcept
Token MatchExcept();
Token MatchExcept(MatchType a);
Token MatchExcept(MatchType a, MatchType b);
Token MatchExcept(MatchType a, MatchType b, MatchType c);
Token MatchExcept(MatchType a, MatchType b, MatchType c, MatchType d);
Token MatchExcept(HashSet<MatchType> set);
Ensures that LA0
does not match the argument(s) given to MatchExcept
, taking any appropriate action (printing an error message or throwing an exception) if LA0
matches the argument(s). Then LA0
is “consumed”, meaning that the input position is increased by one.
In addition, all overloads except the last one must test that LA0
is not EOF
. This rule makes MatchExcept()
(with no arguments) different from MatchAny()
which does allow EOF
.
When a set is passed to MatchExcept
, that set will explicitly contain EOF when EOF is not allowed.
// Example (remember that _ does NOT match EOF)
LLLPG(parser) {
rule MatchExcept @[ _ ~A ~(A|B) ~(A|B|C) ~(A|B|C|D)
~(A|B|C|D|E) (~E | EOF) ];
}
// Generated code
static readonly HashSet<int> NotA_set0 = NewSet(A, B, C, D, E, EOF);
static readonly HashSet<int> NotA_set1 = NewSet(E);
void MatchExcept()
{
MatchExcept();
MatchExcept(A);
MatchExcept(A, B);
MatchExcept(A, B, C);
MatchExcept(A, B, C, D);
MatchExcept(NotA_set0);
MatchExcept(NotA_set1);
}
MatchExceptRange (lexers only)
Token MatchExceptRange(int aLo, int aHi);
Token MatchExceptRange(int aLo, int aHi, int bLo, int bHi);
Verifies that LA0
is not within the specified range(s) of characters, then increases the input position by one.
// Example
LLLPG(lexer) {
rule NotInRanges @[ ~('0'..'9') ~('a'..'z'|'A'..'Z') ];
}
// Generated code
void NotInRanges()
{
MatchExceptRange('0', '9');
MatchExceptRange('A', 'Z', 'a', 'z');
}
Check
void Check(bool expectation, string expectedDescr);
As explained in the section §”Error handling mechanisms in LLLPG” (part 3), this is called to check and-predicate conditions during matching if they were not verified during prediction.
// Example
LLLPG(lexer) {
token DosEquis @[ &!{condition} 'X' 'X' ];
}
// Generated code
void DosEquis()
{
Check(!condition, "!(condition)");
Match('X');
Match('X');
}
SavePosition
struct SavePosition : IDisposable
{
public SavePosition(Lexer lexer, int lookaheadAmt);
public void Dispose();
}
This is used for backtracking. SavePosition
must save the current input position in its constructor, then restore it in Dispose()
.
// Example
LLLPG(lexer) {
token JustOneCapital @[ 'A'..'Z' &!('A'..'Z') ];
}
// Generated code
void JustOneCapital()
{
MatchRange('A', 'Z');
Check(!Try_JustOneCapital_Test0(0), "!([A-Z])");
}
private bool Try_JustOneCapital_Test0(int lookaheadAmt)
{
using (new SavePosition(this, lookaheadAmt))
return JustOneCapital_Test0();
}
private bool JustOneCapital_Test0()
{
if (!TryMatchRange('A', 'Z'))
return false;
return true;
}
TryMatch
bool TryMatch(MatchType a);
bool TryMatch(MatchType a, MatchType b);
bool TryMatch(MatchType a, MatchType b, MatchType c);
bool TryMatch(MatchType a, MatchType b, MatchType c, MatchType d);
bool TryMatch(HashSet<MatchType> set);
Tests whether LA0
matches the argument(s) given to TryMatch
. Returns true if LA0
is a match and false if not. The input position is increased by one.
// Example
LLLPG(lexer) {
[recognizer { bool ScanFiveEvenDigits(); }]
rule FiveEvenDigits @[
zero:='0' ('0'|'2') ('0'|'2'|'4') ('0'|'2'|'4'|'6') ('0'|'2'|'4'|'6'|'8')
];
}
// Generated code
static readonly HashSet<int> FiveEvenDigits_set0 = NewSet('0', '2', '4', '6', '8');
void FiveEvenDigits()
{
var zero = Match('0');
Match('0', '2');
Match('0', '2', '4');
Match('0', '2', '4', '6');
Match(FiveEvenDigits_set0);
}
bool Try_ScanFiveEvenDigits(int lookaheadAmt)
{
using (new SavePosition(this, lookaheadAmt))
return ScanFiveEvenDigits();
}
bool ScanFiveEvenDigits()
{
if (!TryMatch('0'))
return false;
if (!TryMatch('0', '2'))
return false;
if (!TryMatch('0', '2', '4'))
return false;
if (!TryMatch('0', '2', '4', '6'))
return false;
if (!TryMatch(FiveEvenDigits_set0))
return false;
return true;
}
TryMatchRange (lexers only)
bool TryMatchRange(int aLo, int aHi);
bool TryMatchRange(int aLo, int aHi, int bLo, int bHi);
Tests whether LA0
matches one or two ranges of characters. Returns true if LA0
is a match and false if not. The input position is increased by one.
// Example
LLLPG(lexer) {
[recognizer { bool ScanLetterDigit(); }]
rule LetterDigit @[ ('a'..'z'|'A'..'Z') '0'..'9' ];
}
// Generated code
void LetterDigit()
{
MatchRange('A', 'Z', 'a', 'z');
MatchRange('0', '9');
}
bool Try_ScanLetterDigit(int lookaheadAmt)
{
using (new SavePosition(this, lookaheadAmt))
return ScanLetterDigit();
}
bool ScanLetterDigit()
{
if (!TryMatchRange('A', 'Z', 'a', 'z'))
return false;
if (!TryMatchRange('0', '9'))
return false;
return true;
}
TryMatchExcept
bool TryMatchExcept();
bool TryMatchExcept(MatchType a);
bool TryMatchExcept(MatchType a, MatchType b);
bool TryMatchExcept(MatchType a, MatchType b, MatchType c);
bool TryMatchExcept(MatchType a, MatchType b, MatchType c, MatchType d);
bool TryMatchExcept(HashSet<MatchType> set);
Tests whether LA0
matches the argument(s) given to TryMatch
. Returns false if LA0
is a match and true if not. The input position is increased by one.
In addition, all overloads except the last one must test that LA0
is not EOF
. This rule makes TryMatchExcept()
(with no arguments) different from Skip()
which does allow EOF
.
Note: as I write this, it occurs to me that these APIs are redundant. LLLPG could have called TryMatch(...)
instead and inverted the return value. Should this API be removed in a future version?
// Example (remember that _ does NOT match EOF)
LLLPG(parser) {
[recognizer { bool TryMatchExcept(); }]
rule MatchExcept @[ _ ~A ~(A|B) ~(A|B|C) ~(A|B|C|D)
~(A|B|C|D|E) (~E | EOF) ];
}
// Generated code
static readonly HashSet<int> MatchExcept_set0 = NewSet(A, B, C, D, E, EOF);
static readonly HashSet<int> MatchExcept_set1 = NewSet(E);
void MatchExcept()
{ /* Omitted for brevity */ }
bool Try_TryMatchExcept(int lookaheadAmt)
{
using (new SavePosition(this, lookaheadAmt))
return TryMatchExcept();
}
bool TryMatchExcept()
{
if (!TryMatchExcept())
return false;
if (!TryMatchExcept(A))
return false;
if (!TryMatchExcept(A, B))
return false;
if (!TryMatchExcept(A, B, C))
return false;
if (!TryMatchExcept(A, B, C, D))
return false;
if (!TryMatchExcept(MatchExcept_set0))
return false;
if (!TryMatchExcept(MatchExcept_set1))
return false;
return true;
}
TryMatchExceptRange (lexers only)
bool TryMatchExceptRange(int aLo, int aHi);
bool TryMatchExceptRange(int aLo, int aHi, int bLo, int bHi);
Tests whether LA0
matches one or two ranges of characters. Returns false if LA0
is a match and true if not. The input position is increased by one.
I’ll skip the example this time: I think by now you get the idea.
Reference: things you must do when overriding BaseLexer
and BaseParser
For lexing
The typical base class for lexing is BaseLexer
, but you can specialize it as BaseLexer<UString>
, which should (in theory) give higher perfomance if your input is always a string.
In either case, you must call AfterNewline()
whenever you encounter a newline ('\n' | '\r' '\n'?)
so that the LineNumber
property is increased by one. BaseLexer
also contains its own Newline
rule, which you can incorporate into your lexer with
// 'extern' suppresses code generation, so the code is inherited
// from BaseLexer, and `'\r' '\n'? | '\n'` tells LLLPG what it does.
extern token Newline @{ '\r' '\n'? | '\n' };
BaseLexer
additionally records the locations of all line breaks in its SourceFile
property (which is protected
) so you can call SourceFile.IndexToLine(i).Line
to get the line number of any character that has been tokenized so far; this only works properly if you call AfterNewline
consistently.
When LLLPG was first released, you had to override the abstract
error handler:
protected abstract void Error(int lookaheadIndex, string message);
But now a default error handler is provided that throws LogException
, and the normal way to change how errors are reported is not to override Error()
, but instead to set the ErrorSink
property, e.g. this causes errors to be printed to the terminal:
base.ErrorSink = Loyc.MessageSink.Console;
For parsing
When LLLPG was first released, you were expected to use BaseParser
as your base class, which was tedious because you had to override all these methods:
protected abstract Int32 EofInt();
protected abstract Int32 LA0Int { get; }
protected abstract Token LT(int i);
protected abstract void Error(int li, string message);
protected abstract string ToString(Int32 tokenType);
Only one of the above APIs are required by LLLPG itself; the others help BaseParser
implement the other APIs. In addition to the above, you had to implement the following APIs that are required by LLLPG and not provided by BaseParser
:
// (typical implementation shown)
const TokenType EOF = TokenType.EOF;
TokenType LA0 { get { return LT0.Type(); } }
TokenType LA(int offset) { return LT(offset).Type(); }
Because using BaseParser
was cumbersome, BaseParserForList<Token,MatchType>
was introduced (and its specialized form BaseParserForList<Token,MatchType,List>
). BaseParserForList<Token,MatchType>
manages the list of tokens itself - any list that implements IList<Token>
is acceptable, and the derived class constructor must pass a list of tokens to the base class, along with a token that represents EOF, and an ISourceFile
(which you can get from the SourceFile
property of BaseLexer
):
protected BaseParserForList(IList<Token> list, Token eofToken, ISourceFile file, int startIndex=0);
BaseParserForList
only requires you to implement a single abstract
method, to convert MatchType
to a string. MatchType
is usually int
in practise, so your implementation might look like this (if TokenType
is the name of your token type enum):
protected override string ToString(int tokenType)
{
return ((TokenType)tokenType).ToString();
}
All the base classes have an InputPosition
property. BaseLexer
caches the current character in LA0
when InputPosition
changes, while BaseParser
caches the current token in LT0
when InputPosition
changes.
End of Part 4
With part four, I’ve almost finished writing the documentation of LLLPG. So that just leaves…
- Advanced techniques: tree parsing, keyword parsing, collapsing many precedence levels into a single rule, and other tricks used by the EC# parser.
- Things you can do with LeMP: other source code manipulators besides LLLPG.
- A call for volunteers to help me build Enhanced C#.
Stay tuned.