2. Simple examples
30 May 2016Here’s a simple “scanner” in LLLPG - it’s a scanner, in the sense that it scans the input and detects errors but doesn’t produce any output:
LLLPG (lexer)
{
public rule Integer @{ Digit+ };
public rule Digit @{ '0'..'9' };
};
Here I used the original property-like notation for the rules. In LLLPG 1.7.5, an ANTLR-style notation was added, which looks like this:
LLLPG (lexer)
@{
public Digit : '0'..'9';
public Integer : Digit+;
};
Most of this manual uses the first style. Either way, the output looks like this:
public void Digit()
{
MatchRange('0', '9');
}
public void Integer()
{
int la0;
Digit();
for (;;) {
la0 = LA0;
if (la0 >= '0' && la0 <= '9')
Digit();
else
break;
}
}
That’s it! So here’s some relevant facts to learn at this point:
-
“LA” means “lookahead”. LLLPG expects a property called
LA0
to exist that returns the current input character; it also expects anLA(int k)
method for other lookahead values. - The grammar must be wrapped in an
LLLPG
block. Use “LLLPG (lexer)
” for a lexer and “LLLPG (parser)
” for a parser. The difference between the two is the treatment of terminals (characters or tokens):- Lexer mode understands only integer and character input, but is optimized for this input. It does not accept named constants, only literal numbers and characters, because it can’t tell which number a name might refer to (Edit: there is now an
alias
statement for naming constants). This mode assumes -1 means EOF (end-of-file). Note that lookahead variables have typeint
, notchar
, becausechar
cannot hold -1, the representation of EOF. Luckily, C# doesn’t really mind (char
converts toint
implicitly, although not the other way around). - Parser mode does not understand numbers, only symbols. Theoretically, the parser mode could be used for lexers too; the main problem is that it does not understand the range operator
..
, so you’d have to write monstrosities like'0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'
instead of'0'..'9'
. Yuck. Parser mode expects you to define a C# symbol calledEOF
that represents end-of-file. In parser mode, a symbol calledFoo
is assumed to be a terminal if there is no rule calledFoo
.
- Lexer mode understands only integer and character input, but is optimized for this input. It does not accept named constants, only literal numbers and characters, because it can’t tell which number a name might refer to (Edit: there is now an
-
Each rule gets converted into a method with the same name. Attributes on the rule, such as
public
orunsafe
(why are you using unsafe in a parser, smarty pants?) are transferred to the output. Rules support a few attributes, such as[FullLLk]
, that are understood by LLLPG itself and stripped out of the output. Theprivate
attribute is slightly special; it enables an optimization called prematch analysis that non-private rules don’t get. -
The
lexer
mode expects a method calledMatchRange()
to exist (and both modes expect a series ofMatch()
methods for matching particular characters or tokens).MatchRange
must test whether the input character matches the specified range, and to emit an error message if not. On mismatch, you can throw an exception, print a message, or whatever suits you. On success,MatchRange()
should return the consumed character so that you can store it in a variable if you want. A runtime library is provided (Loyc.Syntax.dll) that includes all the required methods includingMatchRange
. -
The
+
operator means “one or more of these”.Digit+
is exactly equivalent toDigit Digit*
; the*
means “zero or more of these”, and the*
operator is translated into a for-loop, as you can see in the output (as you probably know,for (;;)
means “loop indefinitely”; it’s equivalent towhile (true)
.) -
This example also demonstrates the main characteristic of LL(k) parsers: prediction. The
if (la0 >= '0' && la0 <= '9')
statement is performing a task called “prediction”, which means, it is deciding which branch to take (Digit
? or exit the loop?). It must reach across rules to do this: each rule requires an analysis of every other rule it calls, in addition to analysis inside the rule itself. In this case,Integer
must be intimately familiar with the contents ofDigit
. Which is kind of romantic, when you think about it. Or not. -
The body of the rule is enclosed in
@{...};
(or the entire grammar, when using ANTLR-style syntax). Why not just plain braces or something else? Because LLLPG is embedded inside another programming language, and it cannot change the syntax of the host language. The construct “public rule Foo @{...};
” is actually parsed by EC# as a property declaration, except with@{...}
instead of the usual{...}
. The@{...}
is something called a token literal, which is a list of tokens (actually a tree, which matches pairs of( ) { } [ ]
). The EC# parser gathers up all the tokens and stores them for later. After the entire source file is parsed, the macro processor gives LLLPG a chance to receive the token tree and transform it into something else. LLLPG it runs its own independent parser to process the token tree. Finally, it replaces theLLLPG
block with normal C# code that it generated. I’ll explain this process in more detail later. - To keep this example simple and brief I didn’t bother with any “
using
” statements, and I didn’t wrap this code in anamespace
or aclass
. LLLPG (or more precisely, the LeMP preprocessor) doesn’t care; the output reflects the input, so the output will likewise not have any “using
” statements and won’t be wrapped in aclass
or anamespace
either. Garbage in, garbage out. If you want the output to be wrapped in a class declaration, you have to wrap the input in a class declaration.
A simple lexer
My next example is almost useful. Now, somehow you need to provide the helper methods like MatchRange
to LLLPG, and you can do that either with a special base class (BaseLexer
in the Loyc.Syntax.dll runtime library) or a helper object (LexerSource
, which is available as a standalone version, or in Loyc.Syntax.dll). The following example uses the latter:
using System;
using Loyc.Syntax.Lexing;
enum TT
{
Integer, Identifier, Operator
}
class MyLexer
{
LexerSource _src;
public MyLexer(string input) { _src = (LexerSource)input; }
public Token NextToken()
{
int startPosition = InputPosition;
TT tokenType = Token();
int length = InputPosition - _startIndex;
string text = _src.CharSource.Substring(_startIndex, length);
return new Token((int)tokenType, startPosition, length, text);
}
LLLPG (lexer(inputSource: _src, inputClass: LexerSource)) @{
Token returns [TT result]
: Op {$result = TT.Operator;}
| Id {$result = TT.Identifier;}
| Int {$result = TT.Integer;}
;
private token Letter : ('a'..'z'|'A'..'Z');
private token Id : (Letter|'_')
(Letter|'_'|'0'..'9')*;
private token Int: '0'..'9'+;
private token Op: '+'|'-'|'*'|'/';
};
}
This example uses inputSource: _src, inputClass: LexerSource
to tell LLLPG where the helper methods are (in the _src
object and the LexerSource
class).
Unlike ANTLR, which generates text output and treats actions in braces {...}
like plain text, LeMP (which comes with LLLPG) fully parses its input, and generates output in the form of a Loyc tree, not text. A separate library is in charge of formatting that Loyc tree as C# code (I welcome volunteers to write output libraries for other languages such as C++ and Java. You won’t just be helping LLLPG itself, but the entire Loyc project! Let me know if you’re interested.)
The output code includes the plain C# code such as the MyLexer
constructor and the NextToken
method, as well as the output of LLLPG. LLLPG’s output, which becomes part of te MyLexer
class, looks like this:
TT Token()
{
int la0;
TT result = default(TT);
// Line 25: ( Op | Id | Int )
la0 = _src.LA0;
switch (la0) {
case '*':
case '+':
case '-':
case '/':
{
Op();
#line 25 "Untitled.ecs"
result = TT.Operator;
#line default
}
break;
default:
if (la0 >= 'A' && la0 <= 'Z' || la0 == '_' || la0 >= 'a' && la0 <= 'z') {
Id();
#line 26 "Untitled.ecs"
result = TT.Identifier;
#line default
} else {
Int();
#line 27 "Untitled.ecs"
result = TT.Integer;
#line default
}
break;
}
return result;
}
void Letter()
{
_src.Skip();
}
void Id()
{
int la0;
// Line 31: (Letter | [_])
la0 = _src.LA0;
if (la0 >= 'A' && la0 <= 'Z' || la0 >= 'a' && la0 <= 'z')
Letter();
else
_src.Match('_');
// Line 32: ( Letter | [_] | [0-9] )*
for (;;) {
la0 = _src.LA0;
if (la0 >= 'A' && la0 <= 'Z' || la0 >= 'a' && la0 <= 'z')
Letter();
else if (la0 == '_')
_src.Skip();
else if (la0 >= '0' && la0 <= '9')
_src.Skip();
else
break;
}
}
void Int()
{
int la0;
_src.MatchRange('0', '9');
// Line 33: ([0-9])*
for (;;) {
la0 = _src.LA0;
if (la0 >= '0' && la0 <= '9')
_src.Skip();
else
break;
}
}
void Op()
{
_src.Skip();
}
This example demonstrates some new things:
-
The
$result = ...
statements in braces are called actions. Although actions are parsed as C# code (not plain text), it’s important to understand that LLLPG does not understand them. For example, you could write areturn
statement in an action, but LLLPG would not understand that this causes the rule to exit, and would not take that into account during its analysis of the grammar. In this case we could use areturn
statement safely, but in more complicated situations you might introduce a bug in your parser by doing unexpected control flow. So, be smart about it. -
Unlike most parser generators, LLLPG doesn’t create output for you. That’s why a
NextToken()
method exists outside the grammar to create aToken
structure from the text of the token.Token
is included in the runtime library, but you could easily write your own. Now, this example is overly simplistic. In most real-life parsers you’ll want to parse the token, e.g. given"1234"
you’ll want to construct the integer1234
, not merely a string. Typicaly this would be done using custom code inside theInt
rule itself. -
In the output, notice that the
Letter()
andOp()
methods simply callSkip()
, which means “advance to the next input character”. That’s because LLLPG has analyzed the grammar and detected that in all the places whereLetter()
orOp()
is called, the caller has already verified the input! SoLetter()
andOp()
don’t have to check the input, it’s already guaranteed to be correct. This is an optimization called prematch analysis. Note: For this optimization to work, the rule must be explicitly marked private; otherwise, LLLPG assumes that the rule could be called by code outside the grammar. -
Similarly, there are statements like
if (la0 == '_') Skip();
rather thanif (la0 == '_') Match('_');
. The user-suppliedMatch(x)
method must check whetherLA0
matchesx
, butSkip()
is faster since it skips the check. -
All the tokens are marked with the word
token
. The meaning oftoken
will be explained later; it is recommended to put it on all tokens in a lexer, although it often has no effect (in a sense,token
is the opposite of thefragment
marker in ANTLR… sort of). -
LLLPG uses a
switch
statement if it suspects the code could be more efficient that way. Here it usedswitch()
to matchOp()
. However, it tries to balance code size with speed. It does not use switch cases forId()
because it would need 53 “case
” labels to match it (26 uppercase + 26 lowercase +'_'
), which seems excessive.
By the way, this example uses the ANTLR-style syntax mode again. However, most of this manual uses the original syntax because most of the text was written before the ANTLR mode existed.
Think twice: Do you really need a parser generator?
One of the most common introductory examples for any parser generator is an expression parser or calculator, maybe something along these lines:
const int id = 0, num = 1; // other token types: '(', ')', '-', '+', etc.
LLLPG (parser(TerminalType: Token))
@{
Atom returns [double result]
: id { $result = Vars[(string) $id.Value]; }
| num { $result = (double) $num.Value; }
| '-' result=Atom { $result = -$result; }
| '(' result=Expr ')'
| error { $result = 0;
Error(0, "Expected identifer, number, or (stuff)"); }
;
MulExpr returns [double result] :
result:Atom
(op:=('*'|'/') rhs:=Atom { $result = Do(result, op, rhs); })*
;
AddExpr returns [double result] :
result:MulExpr
(op:=('+'|'-') rhs:=MulExpr { $result = Do(result, op, rhs); })*
{ return result; }
;
Expr returns [double result]
: t:=id '=' result:Expr { Vars[t.Value.ToString()] = $result; }
| result:AddExpr
;
};
double Do(double left, Token op, double right)
{
switch (op.Type) {
case '+': return left + right;
case '-': return left - right;
case '*': return left * right;
case '/': return left / right;
}
return double.NaN;
}
But if expression parsing is all you need, you really don’t need a parser generator. For example, you can use the LES parser in LoycCore, which is great for parsing simple expressions. Or you could use a Pratt Parser like this one. If you only need to parse simple text fields like phone numbers, you can use regular expressions. And even if you need an entire programming language, you don’t necessarily need to create your own; again, in many cases the LES parser is perfectly sufficient.
So before you go writing a parser, especially if it’s for something important rather than “for fun”, seriously consider whether an existing parser would be good enough.
Next up
This concludes the introductory material. From here, you can choose where to go next:
- Download & install instructions (for LeMP, which includes LLLPG)
- View samples in the LLLPG-Samples repository
- The making-of: Learn the history of LLLPG and how it is related to EC#, LeMP and LES
- Parsing theory: learn about parsing terminology, especially if you’ve never written a parser. This contains some LLLPG-specific information, so you might want to skim it even if you already know what you’re doing.
- Grammar features: for example, operators like
&
,=>
,~
and/
that you can put in a rule. - About the Loyc Libraries
- How to write a parser: everything from boilerplate to syntax trees
For a complete list of articles, please visit the home page.