Configuring & invoking LLLPG
30 May 2016Command-line options
LLLPG can be invoked either with the custom tool for Visual Studio, or on the command line (or in a pre-build step) by running LLLPG.exe filename. When using the custom tool, command-line options can be written in the “Default Namespace” field in Visual Studio, although not all of them are supported.
The following command-line options are reported by LLLPG –help, but command-line options are rarely necessary.
--forcelang: Specifies that --inlang overrides the input file extension.
Without this option, known file extensions override --inlang.
--help: show this screen
--inlang=name: Set input language: --inlang=ecs for Enhanced C#, --inlang=les for LES
--macros=filename.dll: load macros from given assembly
--max-expand=N: stop expanding macros after N nested or iterated expansions.
--noparallel: Process all files in sequence
--nostdmacros: Don't scan LeMP.StdMacros.dll or pre-import LeMP and LeMP.Prelude
--outext=name: Set output extension and optional suffix:
.ecs (Enhanced C#), .cs (C#), .les (LES)
This can include a suffix before the extension, e.g. --outext=.output.cs
If --outlang is not used, output language is chosen by file extension.
--outlang=name: Set output language independently of file extension
--parallel: Process all files in parallel (this is the default)
--set:key=literal: Associate a value with a key (use #get(key) to read it back)
--snippet:key=code: Associate code with a key (use #get(key) to read it back)
--timeout=N: Aborts the processing thread(s) after this many seconds (0=never)
--verbose: Print extra status messages (e.g. discovered Types, list output files).
Note: in VS, the [Verbosity(N)] grammar attribute doesn’t work without the --verbose option.
Invoking LLLPG in source code
In your *.ecs or *.les input file, the syntax for invoking LLLPG is to use one of these statements:
[general options]
LLLPG(lexer(code generation options)) { /* grammar */ };
LLLPG(lexer) { /* grammar */ };
[general options]
LLLPG(parser(code generation options)) { /* grammar */ };
LLLPG { /* parser mode is the default */ };
Note: LES currently requires the semicolon while EC# does not, and LES files permit LLLPG lexer {...} and LLLPG parser {...} without parenthesis, which (due to the syntax rules of LES) is exactly equivalent to LLLPG(lexer) {...} or LLLPG(parser) {...}).
The braces can be omitted, leaving only a semicolon. In that case the remainder of the current block is treated as the grammar.
The rules of your grammar go inside the braces, but normally you are also allowed to put normal code inside the braces too, such as fields, methods, and child classes.
To use the ANTLR-style syntax mode, put an @ before the opening brace, e.g.
LLLPG(lexer) @{ /* grammar */ };
In this case the braces are required, and normal code (fields, methods, etc.) are not allowed except inside an additional set of braces, e.g.
LLLPG(lexer) @{
/* grammar */
{/* normal code */}
/* grammmar */
};
Code generation options
The following options are available for both lexer and parser:
inputSource: vandinputClass: T: needed by “static” lexers/parsers and parsers instructs. See section ‘Using LLLPG with an “external” API’ belowterminalType: T: data type of terminals. This is used by the colon operator, e.g.x:Terminal, which becomesx = Match(Terminal)in the output, declares a variablexof this type to store the terminal.setType: T: data type for large sets. When you write a set with more than four elements, such as'a'|'e'|'i'|'o'|'u'|'y', LLLPG generates a set object and usesset.Contains(la0)for prediction andMatch(set)for matching, e.g. instead ofMatch('a', 'e', 'i', 'o', 'u', 'y')it generates a set with a statement likestatic HashSet<int> RuleName_set0 = NewSet('a', 'e', 'i', 'o', 'u', 'y');and then callsMatch(RuleName_set0). The default isHashSet<int>.listInitializer: e: Sets the data type of lists declared automatically when you use the+:operator. An initializer likeType x = exprcausesTypeto be used as the list type andexpras the initialization expression. TheTypecan have a type parameterTthat is replaced with the appropriate item type. The default islistInitializer: List<T> = new List<T>().noCheckByDefault: bool: If this option is true, calls toCheck()are eliminated when using semantic or syntactic predicates.
The following options are available only for parser:
laType: T: data type ofla0,la1, etc. Typically this is the name of anenumthat you are using to represent token types (default:int). For lexers,laTypeis alwaysint(notchar, because -1 is used for EOF).matchCast: T: causes a cast to be added to all token types passed toMatch. For example, if you usematchCast: intoption, it will change calls likeMatch('+', '-')intoMatch((int) '+', (int) '-').matchCastis a synonym formatchType.allowSwitch: bool: whether to allowswitchstatements (default:true). In C#, switch cases must be constants, so certainlaTypedata types likeSymbolare incompatible withswitch. Therefore, this option can be used to preventswitchstatements from being generated. Requires a boolean literaltrueorfalse(@trueor@falsein LES).castLa: bool: whether to cast the result ofLA0andLA(i)tolaType(the default istrue)
The above options apply to the lexer or parser helper object, which controls code generation and defines how terminals are interpreted:
lexermode requires numeric terminals, and allows numeric ranges like1..31or'a'..'z'parsermode permits any literal or complex identifier, but does not support numeric ranges.
General options
In addition to the lexer and parser options above, you can add one or more of the following attributes before the LLLPG statement:
[FullLLk(true)]or[FullLLk(false)]: enables or disables complete prediction analysis; for more information, see the appendix FullLLk versus “approximate” LL(k).[Verbosity(int)]: prints extra messages to help debug a grammar. An integer literal is required and specifies how much detail to print:1for basic information,2for extra information,3for excessive information. Details printed include first sets, follow sets, and prediction trees. Note: This attribute does not work without the--verboseoption.[NoDefaultArm(true)]: adds a call toError(...)at all branching points for which you did not provide adefaultorerrorarm (see §”Error handling mechanisms” below).[LL(int)](synonyms:[k(int)]and[DefaultK(int)]): specifies the default maximum number of lookahead characters or tokens in this grammar.[AddComments(false)]: by default, a comment line is printed in the output file in front of the code generated for every Alts (branching point:| / * ?).[AddComments(false)]removes these comments.[AddCsLineDirectives(true)]: adds#linedirectives to the output, in an effort to let errors in actions in the C# file point back to the EC# file. This feature doesn’t work so well, since only line numbers are translated (not column numbers), and it only works inside rules (not inside other code in your .ecs or .les file). This option is largely superceded by the#lines;macro, which can be added to the top of any .ecs file to add#linedirectives throughout it.[PrematchByDefault]: if a rule is only called by other rules (not called from the outside) then “prematch analysis” can sometimes replaceMatch()calls withSkip()calls to improve performance. By default, this is only done for rules that are markedprivate, butPrematchByDefaultextends this optimization to rules that have no access modifier (notpublic, norprivate, norprotectednorinternal.)
Setting lookahead
Pure LL(k) parsers look up to k terminals ahead to make a branching decision, and once a decision is make they stick to it, they don’t “backtrack” or try something else. So if k is too low, LLLPG will generate code that makes incorrect decisions.
LLLPG’s default k value is 2, which is enough in the majority of situations, as long as your grammar is designed to be LL(k). To increase k to X, simply add a [DefaultK(X)] attribute to the grammar (i.e. the LLLPG statement), or add a [k(X)] attribute to a single rule ([LL(X)]is a synonym). Here’s an example that represents "double-quoted" and """triple-quoted""" strings, where k=2 is not enough:
private token DQString @{
'"' ('\' _ | ~('"'|'\'|'r'|'n'))* '"'? ];
};
[k(4)]
private token TQString @{
'"' '"' '"' nongreedy(Newline / _)* '"' '"' '"'
"'''" nongreedy(Newline / _)* "'''"
};
[k(4)]
private token Token @{
( {_type = TT.Spaces;} Spaces
...
| {_type = TT.String;} TQString
| {_type = TT.String;} DQString
...
)
};
Here I’ve used “_” inside both kinds of strings, meaning “match any character”, but this implies that the string can go on and on forever. To fix that, I add nongreedy meaning “exit the loop when it makes sense to do so” ([greedy and nongreedy are explained more in my blog][16].)
With only two characters of lookahead, LLLPG cannot tell whether """this""" is an empty DQString ("") or a triple-quoted TQString. Since TQString is listed first, LLLPG will always choose TQString when a Token starts with "", but of course this may be the wrong decision. You’ll also get a warning like this one:
warning : Loyc.LLParserGenerator.Macros.run_LLLPG:
Alternatives (4, 5) are ambiguous for input such as «""» (["], ["])
[k(3)] is sufficient in this case, but it’s okay if you use a number that is a little higher than necessary, so I’ve used [k(4)] here.
Using LLLPG with an “external” API
You can use the inputSource and inputClass options to designate an object to which LLLPG should send all its API calls. inputClass should be the data type of the object that inputSource refers to. For example, if you specify inputSource(src), LLLPG will translate a grammar fragment like '+'|'-' into code like src.Match('+','-'). Without the inputSource option, this would have just been Match('+','-').
Loyc.Syntax.dll (included with LLLPG 1.3) has external API classes called LexerSource and LexerSource<C> types, which are derived from BaseLexer and provide the LLLPG Lexer API.
When using these options, a lexer will look something like this:
using Loyc;
using Loyc.Syntax.Lexing;
public class MyLexer {
public MyLexer(string input, string fileName = "") {
src = new LexerSource((UString)input, fileName);
}
LexerSource src;
LLLPG (lexer(inputSource: src, inputClass: LexerSource)) {
public rule Token() @{ Id | Spaces | Newline };
private rule Id @{ IdStartChar (IdStartChar|'0'..'9'|'\'')* };
private rule IdStartChar @{ 'a'..'z'|'A'..'Z'|'_' };
private rule Spaces @{ (' '|'\t')+ };
private rule Newline @{ ('\n' | '\r' '\n'?)
{src.AfterNewline();} // increments LineNumber
};
}
}
LexerSource accepts any implementation of (ICharSource](http://ecsharp.net/doc/code/interfaceLoyc_1_1Collections_1_1ICharSource.html); ICharSource represents a source of characters with a Slice(...) method, which is used to speed up access to individual characters. If your input is simply a string S, convert the string to LexerSource using new LexerSource((UString)S); the shortcut (LexerSource)S is also provided. UString is a wrapper around string that implements the ICharSource interface (the U in UString means “unicode”; see the (documentation of UString)[http://ecsharp.net/doc/code/structLoyc_1_1UString.html] for details.)
See also
Grammar features
To learn about semantic and syntactic predicates (also known as zero-width assertions), => gates, the set inversion operator ~, or the underscore _ which matches any character or token, please see LLLPG Grammar Features.
Error handling
To learn about LLLPG’s error handling mechanisms, please see the article about Error Handling.