2. Simple examples

30 May 2016

Here’s a simple “scanner” in LLLPG - it’s a scanner, in the sense that it scans the input and detects errors but doesn’t produce any output:

  LLLPG (lexer)
  {
      public rule Integer @{ Digit+ };
      public rule Digit @{ '0'..'9' };
  };

Here I used the original property-like notation for the rules. In LLLPG 1.7.5, an ANTLR-style notation was added, which looks like this:

  LLLPG (lexer)
  @{
      public Digit : '0'..'9';
      public Integer : Digit+;
  };

Most of this manual uses the first style. Either way, the output looks like this:

  public void Digit()
  {
      MatchRange('0', '9');
  }
  public void Integer()
  {
      int la0;
      Digit();
      for (;;) {
        la0 = LA0;
        if (la0 >= '0' && la0 <= '9')
          Digit();
        else
          break;
      }
  }

That’s it! So here’s some relevant facts to learn at this point:

A simple lexer

My next example is almost useful. Now, somehow you need to provide the helper methods like MatchRange to LLLPG, and you can do that either with a special base class (BaseLexer in the Loyc.Syntax.dll runtime library) or a helper object (LexerSource, which is available as a standalone version, or in Loyc.Syntax.dll). The following example uses the latter:

using System;
using Loyc.Syntax.Lexing;

enum TT
{ 
  Integer, Identifier, Operator
}
class MyLexer
{
  LexerSource _src;
  public MyLexer(string input) { _src = (LexerSource)input; }
  
  public Token NextToken()
  {
    int startPosition = InputPosition;
    TT tokenType = Token();
    int length = InputPosition - _startIndex;
    string text = _src.CharSource.Substring(_startIndex, length);
    return new Token((int)tokenType, startPosition, length, text);
  }

  LLLPG (lexer(inputSource: _src, inputClass: LexerSource)) @{
    Token returns [TT result]
      :  Op  {$result = TT.Operator;}
      |  Id  {$result = TT.Identifier;}
      |  Int {$result = TT.Integer;}
      ;
      
    private token Letter : ('a'..'z'|'A'..'Z');
    private token Id :     (Letter|'_')
                           (Letter|'_'|'0'..'9')*;
    private token Int:     '0'..'9'+;
    private token Op:      '+'|'-'|'*'|'/';
  };
}

This example uses inputSource: _src, inputClass: LexerSource to tell LLLPG where the helper methods are (in the _src object and the LexerSource class).

Unlike ANTLR, which generates text output and treats actions in braces {...} like plain text, LeMP (which comes with LLLPG) fully parses its input, and generates output in the form of a Loyc tree, not text. A separate library is in charge of formatting that Loyc tree as C# code (I welcome volunteers to write output libraries for other languages such as C++ and Java. You won’t just be helping LLLPG itself, but the entire Loyc project! Let me know if you’re interested.)

The output code includes the plain C# code such as the MyLexer constructor and the NextToken method, as well as the output of LLLPG. LLLPG’s output, which becomes part of te MyLexer class, looks like this:

TT Token()
{
    int la0;
    TT result = default(TT);
    // Line 25: ( Op | Id | Int )
    la0 = _src.LA0;
    switch (la0) {
    case '*':
    case '+':
    case '-':
    case '/':
        {
            Op();
            #line 25 "Untitled.ecs"
            result = TT.Operator;
            #line default
        }
        break;
    default:
        if (la0 >= 'A' && la0 <= 'Z' || la0 == '_' || la0 >= 'a' && la0 <= 'z') {
            Id();
            #line 26 "Untitled.ecs"
            result = TT.Identifier;
            #line default
        } else {
            Int();
            #line 27 "Untitled.ecs"
            result = TT.Integer;
            #line default
        }
        break;
    }
    return result;
}
void Letter()
{
    _src.Skip();
}
void Id()
{
    int la0;
    // Line 31: (Letter | [_])
    la0 = _src.LA0;
    if (la0 >= 'A' && la0 <= 'Z' || la0 >= 'a' && la0 <= 'z')
        Letter();
    else
        _src.Match('_');
    // Line 32: ( Letter | [_] | [0-9] )*
    for (;;) {
        la0 = _src.LA0;
        if (la0 >= 'A' && la0 <= 'Z' || la0 >= 'a' && la0 <= 'z')
            Letter();
        else if (la0 == '_')
            _src.Skip();
        else if (la0 >= '0' && la0 <= '9')
            _src.Skip();
        else
            break;
    }
}
void Int()
{
    int la0;
    _src.MatchRange('0', '9');
    // Line 33: ([0-9])*
    for (;;) {
        la0 = _src.LA0;
        if (la0 >= '0' && la0 <= '9')
            _src.Skip();
        else
            break;
    }
}
void Op()
{
    _src.Skip();
}

This example demonstrates some new things:

By the way, this example uses the ANTLR-style syntax mode again. However, most of this manual uses the original syntax because most of the text was written before the ANTLR mode existed.

Think twice: Do you really need a parser generator?

One of the most common introductory examples for any parser generator is an expression parser or calculator, maybe something along these lines:

const int id = 0, num = 1; // other token types: '(', ')', '-', '+', etc.

LLLPG (parser(TerminalType: Token))
@{
   Atom returns [double result]
        : id              { $result = Vars[(string) $id.Value]; }
        | num             { $result = (double) $num.Value; }
        | '-' result=Atom { $result = -$result; }
        | '(' result=Expr ')'
        | error           { $result = 0;
          Error(0, "Expected identifer, number, or (stuff)"); }
        ;
    MulExpr returns [double result] :
        result:Atom
        (op:=('*'|'/') rhs:=Atom { $result = Do(result, op, rhs); })*
        ;
    AddExpr returns [double result] :
        result:MulExpr
        (op:=('+'|'-') rhs:=MulExpr { $result = Do(result, op, rhs); })*
        { return result; }
        ;
    Expr returns [double result]
        : t:=id '=' result:Expr { Vars[t.Value.ToString()] = $result; }
        | result:AddExpr
        ;
};

double Do(double left, Token op, double right)
{
    switch (op.Type) {
        case '+': return left + right;
        case '-': return left - right;
        case '*': return left * right;
        case '/': return left / right;
    }
    return double.NaN;
}

But if expression parsing is all you need, you really don’t need a parser generator. For example, you can use the LES parser in LoycCore, which is great for parsing simple expressions. Or you could use a Pratt Parser like this one. If you only need to parse simple text fields like phone numbers, you can use regular expressions. And even if you need an entire programming language, you don’t necessarily need to create your own; again, in many cases the LES parser is perfectly sufficient.

So before you go writing a parser, especially if it’s for something important rather than “for fun”, seriously consider whether an existing parser would be good enough.

Next up

This concludes the introductory material. From here, you can choose where to go next:

For a complete list of articles, please visit the home page.