Enhanced C#
Language of your choice: library documentation
Nested classes | Public fields | Public Member Functions | List of all members
Loyc.LLParserGenerator.LLParserGenerator Class Reference

Encapsulates LLLPG, the Loyc LL Parser Generator, which generates LL(k) recursive-descent parsers. More...


Source files:

Remarks

Encapsulates LLLPG, the Loyc LL Parser Generator, which generates LL(k) recursive-descent parsers.

LLLPG is an LL(k) parser generator under the umbrella of the Loyc initiative (http://loyc.net). Home page: http://ecsharp.net/lllpg or http://lllpg.com

LLLPG generates recursive-descent parsers for LL(k) grammars. It is designed for parsing computer languages, not natural languages. It also it supports "syntactic predicates" which are zero-width syntactic assertions, and "semantic predicates" which are arbitrary expressions.

The LLParserGenerator class is the core engine. It generates parsers in the form of a Loyc tree, which can be printed out as C# code. Look at the documentation of the Run() method for an overview of how the LLLPG engine works.

Note: the input to LLLPG is usually provided in the form of LES/EC# source code. In that case, there is no need to use this class directly. The source code of Program.Main shows how to invoke LLLPG as a macro via the LeMP.Compiler.

For more information about how to use LLLPG, read http://www.codeproject.com/Articles/664785/A-New-Parser-Generator-for-Csharp

Nested classes

class  ComputeNext
 Gathers a list of all one-token transitions starting from a single position. Also gathers any and-predicates that must be traversed before completing a transition. More...
 
class  GenerateCodeVisitor
 Directs code generation using the visitor pattern to visit the predicates in a rule. The process starts with Generate(Rule). More...
 
class  GetCanonical
 Computes the "canonical" interpretation of a position for prediction purposes, so that ConsolidateDuplicatePositions can detect duplicates reliably. Call Do() to use. More...
 
class  GrammarPos
 Represents a location in a grammar: a predicate and a "return stack" which is a so-called persistent singly-linked list. This type is used within Transition. More...
 
class  KthSet
 Holds information about the first set or kth set of a single arm of an Alts. More...
 
class  PredictionAnalysisVisitor
 Performs prediction analysis using the visitor pattern to visit the predicates in a rule. The process starts with Analyze(Rule). More...
 
class  Transition
 Represents a position in a grammar (GrammarPos) plus the set of characters that leads to that position from the previous position. This is a single case in a KthSet. More...
 

Public fields

int DefaultK = 2
 Specifies the default maximum lookahead for rules that do not specify a lookahead value. More...
 
bool NoDefaultArm = false
 Normally, the last arm in a list of alternatives is chosen as the default. For example, in ("Foo" | "Bar"), the second branch is taken unless the input begins with 'F'. However, if this flag is true, there is no default arm on Alts unless one is specified explicitly, so a special error branch is generated when none of the alternatives apply. This increases code size and decreases speed, but the generated parser may give better error messages. More...
 

Public Member Functions

 LLParserGenerator (IPGCodeGenHelper csg, IMessageSink sink=null)
 
bool FullLLk = true
 Whether to use full LL(k) or "partly approximate" lookahead. More...
 
int Verbosity = 0
 Gets or sets the verbosity level. Verbose output can help you debug grammars that don't produce the expected code. More...
 
bool AddComments = true
 Whether to insert Specifies the default maximum lookahead for rules that do not specify a lookahead value. More...
 
bool AddCsLineDirectives = false
 Whether to add #line directives for C# before and after user actions. More...
 
bool PrematchByDefault = false
 If true, rules that are not marked private, public, or internal are assumed not to be called from outside the grammar so the prematch optimization can be used. More...
 
IMessageSink?? Sink [get, set]
 Called when an error or warning occurs while parsing a grammar or while generating code for a parser. Also called to print "verbose" messages. More...
 
IPGCodeGenHelper?? CodeGenHelper [get, set]
 
static Severity Warning = Severity.Warning
 
static Severity Error = Severity.Error
 
static Severity Verbose = Severity.Verbose
 
ISourceFile _sourceFile
 
WList< LNode_classBody
 
IPGCodeGenHelper _helper = new IntStreamCodeGenHelper()
 
ComputeNext _computeNext = new ComputeNext()
 
GetCanonical _getCanonical = new GetCanonical()
 
void AddRules (params Rule[] rules)
 
void AddRules (IEnumerable< Rule > rules)
 
void AddRule (Rule rule)
 
LNode Run (ISourceFile sourceFile)
 Generates a braced block of code {...} for the grammar described by the rules that were previously added to this object with AddRule or AddRules(Rule[]). More...
 
static Dictionary< Symbol, RuleAddRulesToDict (IEnumerable< Rule > rules, Dictionary< Symbol, Rule > dict=null)
 
KthSet[] ComputeFirstSets (Alts alts)
 
KthSet[] ComputeNextSets (List< KthSet > previous, Alts currentAlts)
 
KthSet ComputeNextSet (KthSet previous, bool addEOF)
 
void MakeCanonical (KthSet next)
 

Member Function Documentation

◆ Run()

LNode Loyc.LLParserGenerator.LLParserGenerator.Run ( ISourceFile  sourceFile)
inline

Generates a braced block of code {...} for the grammar described by the rules that were previously added to this object with AddRule or AddRules(Rule[]).

Parameters
sourceFile
Returns
The generated parser class.

[This may be outdated, TODO: review it]

By far the greatest difficulty in this process is generating prediction code when the grammar branches: (x | y | z). Since this class creates LL(k) parsers without memoization or implicit backtracking, it relies on prediction trees to correctly decide in advance which branch to follow.

The following kinds of grammar elements require prediction:

  • a | b (which is equivalent to a / b): prediction chooses between a and b
  • a?: prediction chooses between a and whatever follows a?
  • a*: prediction chooses between a and whatever follows a*
  • (a | b)*: prediction chooses between three alternatives (a, b, and exiting the loop).
  • (a | b)?: prediction chooses between three alternatives (a, b, and skipping both).
  • a+: exactly equivalent to a a*

All of these are based on an Alts object.

Let's look at a simple example of the prediction code generated for a rule called "Foo":

// rule a 'a' | 'A' };
// rule b 'b' | 'B' };
// public rule Foo a | b };
public void Foo()
{
var la0 = LA0;
if (la0 == 'a' || la0 == 'A')
a();
else
b();
}

By default, to make prediction more efficient, the last alternative is assumed to match if the others don't. So when a doesn't match, b is called even though it has not been verified to match yet. This behavior can be changed by setting NoDefaultArm=true.

Alternatively, you can select the default using the 'default' keyword, which controls the Alts.DefaultArm property, e.g.

// public rule Foo default a | b };
public void Foo()
{
int la0;
la0 = LA(0);
if (la0 == 'b' || la0 == 'B')
b();
else
a();
}

In simple cases like this one that only require LL(1) prediction, prediction and matching are merged into a single if-else chain. In more complicated cases, goto statements may be used to avoid code duplication (ANTLR uses pairs of if-else or switch statements instead, but I chose to use gotos because the generated code will be faster.) The if-else statements are the "prediction" part of the code, while the calls to a() and b() are the "matching" part.

Here's another example:

// public rule Foo (a | b? 'c')* };
public void Foo()
{
int la0;
for (;;) {
la0 = LA(0);
if (la0 == 'a' || la0 == 'A')
a();
else if (la0 == 'b' || la0 == 'B' || la0 == 'c') {
la0 = LA(0);
if (la0 == 'b' || la0 == 'B')
b();
Match('c');
} else
break;
}
}

A kleene star (*) always produces a "for(;;)" loop, while an optional item may produce a "do ... while(false)" pseudo-loop in some circumstances (but this case is too simple to require it). Here there are two separate prediction phases: one for the outer loop (a | b? 'c')*, and one for b?.

In this example, the loop appears at the end of the rule. In some such cases, the "follow set" of the rule becomes relevant. In order for the parser to decide whether to exit the loop or not, it may need to know what can follow the loop. For instance, if ('a' 'b')* is followed by 'a'..'z' 'c', it is not possible to tell whether to stay in the loop or exit just by looking at the first input character. If LA(0) is 'a', it is necessary to look at the second character LA(1); only if the second character is 'b' is it possible to conclude that 'a' 'b' should be matched.

Therefore, before generating a parser one of the steps is to build the follow set of each rule, by looking for places where a rule appears inside other rules. A rule is not aware of its current caller, so it gathers information from all call sites and merges it together. When a rule is marked "public", it is considered to be a starting rule, which causes the follow set to include $ (which means "end of input").

The fact that LLLPG is aware of follow sets and the differences between alternatives, and the fact that its generated parsers do not normally backtrack, makes LLLPG's LL(k) parsing tecnique fundamentally different from another popular parsing technique, PEG. The documentation of LLParserGenerator explains further.

Here's an example that needs more than one character of lookahead:

// public rule Foo 'a'..'z'+ | 'x' '0'..'9' '0'..'9' };
public void Foo()
{
int la0, la1;
do {
la0 = LA(0);
if (la0 == 'x') {
la1 = LA(1);
if (la1 >= '0' &amp;&amp; '9' >= la1) {
Match();
Match();
MatchRange('0', '9');
} else
goto match1;
} else
goto match1;
break;
match1:
{
Match();
for (;;) {
la0 = LA(0);
if (la0 >= 'a' &amp;&amp; 'z' >= la0)
Match();
else
break;
}
}
} while (false);
}

Here, the prediction and matching phases are merged for the second alternative, but separate for the first alternative (because it is chosen in two different places in the prediction logic). Notice that the matching for alt 2 starts with Match() twice, with no arguments, but is followed by MatchRange('a', 'z'). This demonstrates communication from prediction to matching: the matching phase can tell that LA(0) is confirmed to be 'x', and LA(1) is confirmed to be '0'..'9', so an unconditional match suffices. However, nothing is known about LA(2) so its value must be checked, which is what MatchRange() is supposed to do.

In some cases, LA(0) is irrelevant. Consider this example:

// public rule Foo '(' 'a'..'z'* ')' | '(' '0'..'9'+ ')' };
public void Foo()
{
int la0, la1;
la1 = LA(1);
if (la1 >= 'a' &amp;&amp; 'z' >= la1) {
Match('(');
for (;;) {
la0 = LA(0);
if (la0 >= 'a' &amp;&amp; 'z' >= la0)
Match();
else
break;
}
Match(')');
} else {
Match('(');
MatchRange('0', '9');
for (;;) {
la0 = LA(0);
if (la0 >= '0' &amp;&amp; '9' >= la0)
Match();
else
break;
}
Match(')');
}
}

Here, the first character of both alternatives is always '(', so looking at LA(0) doesn't help choose which branch to take, and prediction skips ahead to LA(1).

And-predicates

An and-predicate specifies an extra condition on the input that must be checked. Here is a simple example:

(&amp;{flag} '0'..'9' | 'a'..'z')

This example says that '0'..'9' is only allowed if the expression flag evaluates to true, otherwise 'a'..'z' is required. LLPG, however, gives and-predicates lower priority, and always inverts the order of the testing: it checks for '0'..'9' first, then checks flag afterward. I chose to make LLPG work this way because in general, and- predicates can be much more expensive to check than character sets; if one of the alternatives rarely runs, it would be wasteful to check an expensive and-predicate before checking if the input character could possibly match. Therefore, the generated code looks like this:

la0 = LA(0);
if (la0 >= '0' &amp;&amp; la0 <= '9') {
Check(flag);
Match();
} else
MatchRange('a', 'z');

If you really need to make the and-predicate run first for some reason, I dunno. I got nothin'. Complain to me every month until I implement something, maybe.

A generated parser performs prediction in two interleaved parts: character-set tests, and and-predicate tests. In this example,

('0'..'9'+ | &amp;{hexAllowed} '0' 'x' ('0'..'9'|'a'..'f')+)

The code will look like this:

do {
la0 = LA(0);
if (la0 == '0') {
if (hexAllowed) {
la1 = LA(1);
if (la1 == 'x') {
Match();
Match();
MatchRange('0', '9', 'a', 'f');
...
} else
goto match1;
} else
goto match1;
} else
goto match1;
break;
match1:;
{
MatchRange('0', '9');
...
}
} while (false);

Here you can see the interleaving: first the parser checks LA(0), then it checks the and-predicate, then it checks LA(1).

LLPG (let's call it 1.0) does not support any analysis of the contents of an and-predicate. Thus, without loss of generality, these examples use semantic predicates &{...} instead of syntactic predicates &(...); LLPG can't "see inside them" either way.

Even without analyzing the contents of an and-predicate, they can still make prediction extremely complicated. Consider this example:

(.&amp;{a} (&amp;{b} {B();} | &amp;{c})
&amp;{d} [&amp;{e} ('e'|'E')]?
(&amp;{f} ('f'|'t') | 'F')
| &amp;{c} (&amp;{f} ('e'|'t') | 'f') 'g'
| '!' )

In this example, the first branch requires 'a' and 'd' to be true, there's a pair of zero-width alternatives that require 'b' or 'c' to be true, {B()} must be executed if 'b' is true, 'e' must be true if LA(0) is ('e'|'E'), 'f' must be true if LA(0) is 'f' and no condition is required for 'F'. The second branch also allows 'e' or 'f', provided that 'c' is true, but requires 'f' if LA(0) is 'e'.

LLLPG appears to handle this case correctly.

References Loyc.LLParserGenerator.IPGCodeGenHelper.Begin(), Loyc.LLParserGenerator.IPGCodeGenHelper.Done(), Loyc.Localize.Localized(), and Loyc.LLParserGenerator.LLParserGenerator.Verbosity.

Member Data Documentation

◆ AddComments

bool Loyc.LLParserGenerator.LLParserGenerator.AddComments = true

Whether to insert Specifies the default maximum lookahead for rules that do not specify a lookahead value.

◆ AddCsLineDirectives

bool Loyc.LLParserGenerator.LLParserGenerator.AddCsLineDirectives = false

Whether to add #line directives for C# before and after user actions.

Superceded. It is preferred to use the #lines macro instead.

◆ DefaultK

int Loyc.LLParserGenerator.LLParserGenerator.DefaultK = 2

Specifies the default maximum lookahead for rules that do not specify a lookahead value.

◆ FullLLk

bool Loyc.LLParserGenerator.LLParserGenerator.FullLLk = true

Whether to use full LL(k) or "partly approximate" lookahead.

LLLPG's original disambiguation mode was similar to the "linear approximate" lookahead present in the ANTLR v2 parser generator. However, ANTLR's linear approximate lookahead failed to predict the following case correctly:

Foo ('a' 'b' | 'c' 'd') ';'
| 'a' 'd' ';' };

LLLPG has no problem with this case. However, LLLPG's "somewhat approximate" lookahead still has problems with certain cases involving nested alternatives. Here's a case that it can't handle:

Foo ('a' 'b' | 'b' 'a') ';'
| ('a' 'a' | 'b' 'b') ';' };

Basically here's what goes wrong: LLLPG detects that both alternatives can start with 'a' or 'b'. The way it normally builds a prediction tree is by creating a test for the common set between two alternatives:

la0 = LA(0);
if (la0 == 'a' || la0 == 'b') { ... alt 1 or alt 2 ... }

Then, inside that "if" statement it adds a test for LA(1). Sadly, LLLPG discovers that if (la1 == 'a' || la1 == 'b'), both alternatives still apply. Thus, it can't tell the difference between the two and gives up, picking the first alternative unconditionally and printing a warning that "Branch 2 is unreachable".

To fix this, LLLPG must figure out that it should split the LA(0) test into two separate "if" clauses. I've figured out how to do this, and after using it for some time I'm fairly confident that it works correctly, so I have enabled it by default as of v2.7.0.0. Full LL(k) creates subtly different results than the original prediction method, and overall runtime may be significantly faster or slower.

◆ NoDefaultArm

bool Loyc.LLParserGenerator.LLParserGenerator.NoDefaultArm = false

Normally, the last arm in a list of alternatives is chosen as the default. For example, in ("Foo" | "Bar"), the second branch is taken unless the input begins with 'F'. However, if this flag is true, there is no default arm on Alts unless one is specified explicitly, so a special error branch is generated when none of the alternatives apply. This increases code size and decreases speed, but the generated parser may give better error messages.

When this flag is false, an error branch is still generated on a particular loop if requested with Alts.ErrorBranch.

◆ PrematchByDefault

bool Loyc.LLParserGenerator.LLParserGenerator.PrematchByDefault = false

If true, rules that are not marked private, public, or internal are assumed not to be called from outside the grammar so the prematch optimization can be used.

◆ Verbosity

int Loyc.LLParserGenerator.LLParserGenerator.Verbosity = 0

Gets or sets the verbosity level. Verbose output can help you debug grammars that don't produce the expected code.

Level 1 verbosity prints simplified prediction trees in each rule, and the follow sets of each rule. Level 2 verbosity prints prediction trees before they are simplified, and before they have been extended to handle unspecified cases (e.g. if your rule says 'a' 'b' | 'c' 'd', the unspecified cases are all other possible inputs.) Level 3 verbosity prints level 1 and 2 information.

Referenced by Loyc.LLParserGenerator.LLParserGenerator.Run().

Property Documentation

◆ Sink

IMessageSink?? Loyc.LLParserGenerator.LLParserGenerator.Sink
getset

Called when an error or warning occurs while parsing a grammar or while generating code for a parser. Also called to print "verbose" messages.

The parameters are (1) a Node that represents the location of the error, or Node.Missing if the grammar was created programmatically without any source code backing it; (2) a predicate related to the error, or null if the error is a syntax error; (3) "Warning" for a warning, "Error" for an error, or "Verbose"; and (4) the text of the error message.