This page has moved to ecsharp.net.

LLLPG Part 1: A new parser generator for C#

7 Oct 2013 (updated 6 Mar 2016). Originally published on CodeProject.

This page is obsolete

The LLLPG manual has been reorganized. These old articles may be deleted in the future.

Introduction

LLLPG (Loyc LL(k) Parser Generator) is a recursive-decent parser generator for C#, with a feature set slightly better than ANTLR version 2.

In this article I assume you already know what parsers and lexers are; if not, click the links. If you haven’t written parsers before, article #2 will fill in your knowledge.

LLLPG is a system that I decided to create after trying to use ANTLR3’s C# module, and running into C#-specific bugs that I couldn’t overcome. Besides, I wasn’t happy with the ANTLR-generated code; I thought I could generate simpler and more efficient code. “How hard could it be to make an LL(k) parser generator?” I wondered. The answer: pretty damn hard, actually.

It took me a bit longer to make LLLPG than I intended (what, 5 years?), but… better late than never, right? While ANTLR has advanced some ways in that time period, it is still Java-centric, and I think the advantages of LLLPG still make it worth considering even if all the major C#-specific bugs in ANTLR have been fixed (I don’t know if they have or not, but the C# version still lags behind the Java version).

Typically, you will use the LLLPG Visual Studio Custom Tool (a.k.a. Single-File Generator):

LLLPG in Visual Studio

LLLPG is not a dedicated tool the way ANTLR is. Instead, LLLPG is designed to be embedded inside another programming language. While you may use LLLPG similarly to other parser generators, it’s really just a “macro” inside a programming language I’m making called Enhanced C# — one of a hundred macros that you might be using, and perhaps in the future you’ll write a macro or two yourself.

As of 2016, Enhanced C# is incomplete; only two components of it are ready (the parser, and the macro runner which is called LeMP). Hopefully though, you’ll find it fairly user-friendly and fun.

Other design elements of LLLPG include:

“Blah, blah, blah! Show me this thing already!”

Okay, let’s get started! Here’s a really simple example (EC#):

LLLPG (lexer) {
    public rule Digit @{ '0'..'9' };
    public rule Integer @{ Digit+ };
};

And the output is:

public void Digit()
{
    MatchRange('0', '9');
}
public void Integer()
{
    int la0;
    Digit();
    for (;;) {
      la0 = LA0;
      if (la0 >= '0' && la0 <= '9')
        Digit();
      else
        break;
    }
}

That’s it! So here’s some relevant facts to learn at this point:

Another example

My next example is almost useful: a simple lexer.

using System;

enum TT { 
	Integer, Identifier, Operator
}
LLLPG (lexer) {
    private token Letter @{ ('a'..'z'|'A'..'Z') };
    private token Id  @{ (Letter|'_')
                         (Letter|'_'|'0'..'9')* };
    private token Int @{ '0'..'9'+ };
    private token Op  @{ '+'|'-'|'*'|'/' };
    public token TT Token @
      {  Op  {return TT.Operator;}
      |  Id  {return TT.Identifier;}
      |  Int {return TT.Integer;}
      };
};

In this example, using System and the enum are completely standard C# code and it will be printed out unchanged (well, it’ll be reformatted. Whitespaces and comments are not preserved.) The rest is a mixture of Enhanced C# and LLLPG code.

Unlike ANTLR, which generates text output and treats actions in braces {...} like plain text, LLLPG fully parses its input, and generates output in the form of a Loyc tree, not text. An independent library is in charge of formatting that Loyc tree as C# code (I welcome volunteers to write output libraries for other languages such as C++ and Java. You won’t just be helping LLLPG itself, but the entire Loyc project! Let me know if you’re interested.)

Here’s the generated code:

using System;
enum TT
{
    Integer, Identifier, Operator
}
void Letter()
{
    Skip();
}
void Id()
{
    int la0;
    la0 = LA0;
    if (la0 >= 'A' && la0 <= 'Z' || la0 >= 'a' && la0 <= 'z')
      Letter();
    else
      Match('_');
    for (;;) {
      la0 = LA0;
      if (la0 >= 'A' && la0 <= 'Z' || la0 >= 'a' && la0 <= 'z')
        Letter();
      else if (la0 == '_')
        Skip();
      else if (la0 >= '0' && la0 <= '9')
        Skip();
      else
        break;
    }
}
void Int()
{
    int la0;
    MatchRange('0', '9');
    for (;;) {
      la0 = LA0;
      if (la0 >= '0' && la0 <= '9')
        Skip();
      else
        break;
    }
}
void Op()
{
    Skip();
}
public TT Token()
{
    int la0;
    la0 = LA0;
    switch (la0) {
    case '*':
    case '+':
    case '-':
    case '/':
      {
        Op();
        return TT.Operator;
      }
      break;
    default:
      if (la0 >= 'A' && la0 <= 'Z' || la0 == '_' ||
                la0 >= 'a' && la0 <= 'z') {
        Id();
        return TT.Identifier;
      } else {
        Int();
        return TT.Integer;
      }
      break;
    }
}

This example demonstrates some new things:

ANTLR-like syntax mode

As of LLLPG 1.7.4, an ANTLR-like syntax mode has been added that makes it easier to port grammars between ANTLR and LLLPG. This mode is used by writing code like

class MyParser : BaseParserForList {
    [/* general options */]
    LLLPG (/* code generation options */) @{
        /* ANTLR-style rules */
        myRule [ArgType argument] returns [ReturnType result] : grammar ;
    };
}

In this mode, the LLLPG block can only contain rules (not ordinary C# methods and fields).

Although the outline of each rule uses an ANTLR-like syntax, there are still substantial differences:

Also note:

If you’re only interested in the parser generator, please skip this section, because right now I’d like to discuss the fancy technology that LLLPG is built on. In fact, you can skip most of the rest of the article and go straight to part 2.

The LeMP processing model

If you’re only interested in the parser generator, please skip this section, because right now I’d like to discuss the fancy technology that LLLPG is built on. In fact, you can skip most of the rest of the article and go straight to part 2.

So here’s the deal. I designed a language called Enhanced C#. It’s supposed to be about 99.9% backward compatible with C#, and the parser is about 95% complete (LINQ support is missing, but C# 6 syntax is available.) There is no EC# compiler yet, but there is a parser and a printer; so you use the parser + LeMP + printer and feed the output to the plain C# compiler. With a few lines of code, you can parse a block of EC# code and print it out again:

using (LNode.PushPrinter(Ecs.EcsNodePrinter.Printer))
using (ParsingService.PushCurrent(Ecs.Parser.EcsLanguageService.Value))
{
    string input = @"{ your_code_here(Any_EC#_code_will_work); }";
    LNode code = ParsingService.Current.ParseSingle(input, MessageSink.Console);
    string output = code.ToString();
}

Since EC# is a superset of C#, LLLPG is able to produce C# code by using the EC# printer, as long as it only uses the C# subset of the language.

Earlier you saw a screen shot of LINQPad in which the input to LLLPG was LES code. LES is not a programming language, it is just a syntax and nothing else. One of the core ideas of the Loyc project is to “modularize” programming languages into a series of re-usable components. So instead of writing one big compiler for a language, a compiler is built by mixing and matching components. One of those components is the Loyc tree (the LNode class in Loyc.Syntax.dll). Another component is the LES parser (which is a text representation for Loyc trees). A third component is the EC# parser, a fourth component is the EC# printer, and a fifth component is LeMP, the macro processor.

Macros

Macros are a fundamental feature of LISP that I am porting over to the wider world of non-LISP languages.

A macro (in the LISP sense, not in the C/C++ sense) is simply a method that takes a syntax tree as input, and produces another syntax tree as output. Here’s an example of a macro:

[LexicalMacro("Name::Type",
  "Defines a variable or field in the current scope.", "#::")]
public static LNode ColonColon(LNode node, IMacroContext sink)
{
    var a = node.Args;
    if (a.Count == 2) {
        LNode r = node.With(CodeSymbols.Var, a[1], a[0]);
        r.BaseStyle = NodeStyle.Statement;
        return r;
    }
    return null;
}

This macro is part of the “LES Prelude class” for LeMP. Its job is to take a LES “variable declaration” such as x::Foo and change it into a different tree that the C# language service understands: #var(Foo, x), which represents Foo x;.

The input, x::Foo, is represented as a call to #:: with two arguments, x and Foo. ColonColon() is designed to transform the call to “#::”. It checks that there are two arguments, and swaps them while changing the call target from #:: to S.Var, which is an alias for #var. The C# node printer considers #var(type, name) to represent a variable declaration, and prints it with the more familiar syntax “type name”.

The point is, LLLPG is defined as a “macro” that takes your LLLPG (lexer) { ... }; or LLLPG (parser) { ... }; statement as input, and returns another syntax tree that represents C# source code. As a macro, it can live in harmony with other macros like the ColonColon macro.

Bootstrapping LLLPG

In order to allow LLLPG to support EC#, I needed a EC# parser. But how would I create a parser for EC#? Obviously, I wanted to use LLLPG to write the parser, but without any parser there was no easy way to submit a grammar to LLLPG! After writing the LLLPG core engine and the EC# printer, here’s what I did to create the EC# parser:

  1. I used C# operator overloading and helper methods as a rudimentary way to write LLLPG parsers in plain C# (example test suite).
  2. Writing parsers this way is very clumsy, so I decided that I couldn’t write the entire EC# parser this way. Instead, I designed a new language that is syntactically much simpler than EC#, called LES. This language would serve not only as the original input language for LLLPG, but as a general syntax tree interchange format—a way to represent syntax trees of any programming language: “xml for code”.
  3. I wrote a lexer and parser for LES by calling LLLPG programmatically in plain C# (operator overloading etc.)
  4. I wrote the MacroProcessor (which I later named “LeMP”, short for “Lexical Macro Processor”) and a wrapper class called Compiler that provides the command-line interface. MacroProcessor’s job is to scan through a syntax tree looking for calls to “macros”, which are source code transformers (more on that below). It calls those transformers recursively until there are no macro calls left in the code. Finally, Compiler prints the result as text.
  5. I built a small “macro language” on top of LES which combines LeMP (the macro processor) with a set of small macros that makes LES look a lot like C#. The macros are designed to convert LES to C# (you can write C# syntax trees directly in LES, but they are a bit ugly.)
  6. I wrote some additional macros that allow you to invoke LLLPG from within LES.
  7. I hacked the LES parser to also be able to parse LLLPG code like @{ a* | b } in a derived class (a shameful abuse of “reusable code”).
  8. I wrote a lexer and parser for LES in LES itself.
  9. I published this article in (Oct 2013).
  10. I wrote the lexer and parser of EC# in LES, in the process uncovering a bunch of new bugs in LLLPG (which I fixed)
  11. I added one extra macro to support LLLPG in EC#.
  12. Finally, to test the EC# parser, I rewrote the grammar of LLLPG in EC#.
  13. I updated this article (Feb 2014).

At long last the bootstrapping is complete, so you can write LLLPG parsers in EC#!

LLLPG’s input languages: EC# & LES

Enhanced C# (EC#)

As I mentioned, Enhanced C# is a language based on C# whose compiler doesn’t exist yet (I’m looking for volunteers to help.) The parser does exist, though, so I can talk about some of the new syntax that EC# supports. Actually there is quite a bit of new syntax in EC#; let me just tell you about the syntax that is relevant to LLLPG.

Token literals

Loyc.Syntax.Lexing.TokenTree eightTokens = @{
    This token tree has eight children
    (specifically, six identifiers and two parentheses.
     The tokens inside the parentheses are children of the opening '('.)
};

LLLPG is a “domain-specific language” or DSL, which means it’s a special-purpose language (for creating parsers).

Token trees are a technique for allowing DSLs (Domain-Specific Languages) without any need for “dynamic syntax” in the host language. Unlike some “extensible” languages, EC# and LES do not have “extensible” parsers: you cannot add new syntax to them. However, EC# and LES do have token literals, which are collections of tokens. After a source file has been parsed, a macro can run its own parser on a token tree that is embedded in the larger syntax tree. That’s what LLLPG does.

EC# allows token trees in any location where an expression is allowed. It also allows you to use a token tree instead of a method body, or instead of a property body. So when the EC# parser encounters statements like these:

rule Bar @{ "Bar" };
rule int Foo(int x) @{ 'F' 'o' 'o' {return x;} };

The parser actually sees these as property or method declarations. LLLPG’s ECSharpRule macro then transforms them into a different form, shown here:

#rule(void, Foo, (), "Bar");
#rule(int, Foo, (#var(int, x),), ('F', 'o', 'o', { return x; }));

(A separate macro recognizes the LES-specific syntax for rules and produces this common form.)

The main LLLPG macro is in charge of turning this into the final output:

void Bar()
{
  Match('B');
  Match('a');
  Match('r');
}
int Foo(int x)
{
  Match('F');
  Match('o');
  Match('o');
  return x;
}

Note: you may also see token literals with square brackets @[...]. This means the same thing as @{...}; there are two syntaxes for a historical reason, as explained in the next article.

Block-call statements

get { return _foo; }
set { _foo = value; }

In C#, you may have noticed that “get” and “set” are not keywords. Not unless they are inside a property declaration, that is. In C#, they “become” keywords based on their location.

EC# generalizes this concept. In EC#, get and set are never keywords no matter where they appear. Instead, get {...} and set {...} are merely examples of a new kind of statement, the block-call statement, which has two forms:

identifier { statements; }                  identifier (argument, list) { statements; }

These statements are considered exactly equivalent to method calls of the following form:

identifier({ statements; });                 identifier(argument, list, { statements; });

So the LLLPG block:

LLLPG (parser(laType(TokenType), matchType(int))) {
   rule Foo @{ ... };
}

Is a block-call statement, equivalent to

LLLPG (parser(laType(TokenType), matchType(int)), {
   rule Foo @{...};
});

Blocks as expressions

string fraction = "0.155";
float percent = 100 * { if (!float.TryParse(fraction, out float tmp)) tmp = -1; tmp };

In EC#, {braced blocks} can be used as expressions, which explains what a method call like foo(x, { y(z) }) means. When a block is used inside an expression, like this, the final statement in the block becomes the return value of the block as a whole, when there is no semicolon on the end of the final statement. In this case, tmp is the result of the block, so percent will be set to 15.5. Or rather, it will be set to 15.5 after the EC# compiler is written. Until then, this is merely an interesting but useless syntax tree.

In the case of a statement like

LLLPG (parser(laType(TokenType), matchType(int)), { rule Foo @{...}; });

Everything in parenthesis is passed to a macro belonging to LLLPG, which (to make a long story short) transforms it into C# code.

That’s enough information to understand how LLLPG works. Hopefully now you understand the concept of LLLPG as a DSL embedded in EC#.

Loyc Expression Syntax (LES)

When LLLPG was first released, you had to write parsers in LES since the EC# parser hadn’t been written, so a large section of this article was devoted to LES - a language that is very comparable to Enhanced C#, but dramatically easier to parse.

If you’d like to see how you can write “C# code” in LES, I’ve separated that information into a separate page.

All LES-based parsers now need to have the following first line:

import macros LeMP.Prelude.Les;

Originally this was not required because the LES “prelude” macros were imported automatically. However, the LES prelude could potentially interfere with normal C# code, so it is no longer imported automatically (the macro compiler doesn’t know anything about the input language, so it is unaware of whether it should import the macros or not).

Wrapping up

Obviously, I’ve just scratched the surface here, but this is almost enough for an introduction. As a next step, I suggest trying out the demo lexer and parsers included with LLLPG.

To download it or read more articles about LLLPG, return to the home page. Here’s a list of topics that are covered in future articles:

So that’s it! Hope you like my parser generator, folks, and I’ll be happy to answer your questions. But you probably don’t have questions. Because I pretty much covered everything.

History

See version history for a more detailed history.