Enhanced C#
Language of your choice: library documentation
Properties | Public Member Functions | Protected Member Functions | List of all members
Loyc.Syntax.Lexing.IndentTokenGenerator< Token > Class Template Referenceabstract

A preprocessor usually inserted between the lexer and parser that inserts "indent", "dedent", and "end-of-line" tokens at appropriate places in a token stream. More...


Source file:
Inheritance diagram for Loyc.Syntax.Lexing.IndentTokenGenerator< Token >:
Loyc.Syntax.Lexing.LexerWrapper< Token > Loyc.Syntax.Lexing.ILexer< Token > Loyc.Syntax.IIndexToLine

Remarks

A preprocessor usually inserted between the lexer and parser that inserts "indent", "dedent", and "end-of-line" tokens at appropriate places in a token stream.

This class will not work correctly if the lexer does not implement ILexer<T>.IndentLevel properly.

This class is abstract because it doesn't know how to classify or create tokens. The derived class must implement GetTokenCategory, MakeEndOfLineToken, MakeIndentToken and MakeDedentToken. IndentTokenGenerator is a non-abstract version of this class based on Loyc.Syntax.Lexing.Token structures, with several properties that can be customized.

Creation of indent, dedent, and end-of-line tokens can be suppressed inside brackets, i.e. () [] {}. This is accomplished by recognizing brackets inside your implementation of GetTokenCategory.

TokensToTree can be placed in the pipeline before or after this class; if it is placed afterward, anything between Indent and Dedent tokens will be made a child of the Indent token.

Note: whitespace tokens (TokenCategory.Whitespace) are passed through and otherwise unprocessed.

Note: EOL tokens are not generated for empty or comment lines, and are not generated after a generated indent token, although they could be generated after a pre-existing indent token that was already in the token stream, unless that token is categorized as TokenCategory.OpenBracket.

Partial dedents and unexpected indents, as in

if Condition:
print("Hello")
print("Hello again")
else:
print("Goodbye")
print("Goodbye again")

will cause an error message to be written to the ILexer<Tok>.ErrorSink of the original lexer.

Please see IndentTokenGenerator for additional remarks and examples.

Suppose you use an IndentToken and DedentToken that are equal to the token types you've chosen for { braces } (e.g. (TokenKind.LBrace and TokenKind.RBrace), the only indent trigger is a colon (:), and you set EolToken to the token type you're using for semicolons. Then the token stream from input such as

def Sqrt(value):
if value == 0: return 0
g = 0; bshft = Log2Floor(value) >> 1;
b = 1 << bshft
do:
temp = (g + g + b) << bshft
if value >= temp: g += b
value -= temp
b >>= 1
while (bshft&ndash; > 0)
return g

will be converted to a token stream equivalent to

def Sqrt(value): {
if value == 0: { return 0;
} g = 0; bshft = Log2Floor(value) >> 1;
b = 1 << bshft;
do: {
temp = (g + g + b) << bshft
if value >= temp: { g += b;
value -= temp;
} b >>= 1;
} while (bshft&ndash; > 0);
return g;
}

That is, a semicolon is added to lines that don't already have one, open braces are inserted right after colons, and semicolons are not added right after opening braces.

If multiple indents occur on a single line, as in

if x: if y:
Foo(x, y)

The output will be like this:

if x: { if y: {
Foo(x, y);
}}

Configuration for Python

Newlines generally represent the end of a statement, while colons mark places where a "child" block is expected. Inside parenthesis, square brackets, or braces, newlines are ignored:

s = ("this is a pretty long string that I'd like "
+ " to continue writing on the next line")

And, inside brackets, indentation is ignored, so this is allowed:

if foo:
s = ("this is a pretty long string that I'd like "
+ " to continue writing on the next line")
print(s)

Note that if you don't use brackets, Python 3 doesn't try to figure out if you "really" meant to continue a statement on the next line:

<h1>SyntaxError after '+': invalid syntax</h1>
s = "this is a pretty long string that I'd like " +
" to continue writing on the next line"

Thus OpenBrackets and CloseBrackets should be ( [ { and ) ] }, respectively. IndentType and DedentType should be synthetic Indent and Dedent tokens, since curly braces have a different meaning (they define a dictionary).

In Python, it appears you can't write two "block" statements on one line, as in this example:

if True: if True: print() # SyntaxError: invalid syntax

You're also not allowed to indent the next line if the block statement on the current line is followed by another statement:

if True: print('a')
print('b') # IndentationError: unexpected indent

But you can switch style in different branches:

if True:
print("t")
else: print("f")
try: print("t")
except:
print("e")

Also, although you can normally separate statements with semicolons:

print("hell", end=""); print("o")

You are not allowed to write this:

print("?"); if True: # SyntaxError: invalid syntax
print("t")

Considering these three facts, I would say that the colon should be classified as an EOL indent trigger (EolIndentTriggers), and the parser should

  1. recognize non-block statements separately from block statements,
  2. expect a colon to be followed by either an indented block or a non-block statement, but
  3. recognize a non-block "statement" as a list of statements separated by semicolons, with an optional semicolon at the end.

Now, Python doesn't allow a block statement without a pass, e.g.:

if cond: # "do nothing"
return # IndentationError: expected an indented block

I'm inclined to treat this as a special case to be detected in the parser. And although you can write a semicolon on a line by itself, you can't write any of these lines:

if cond: ; # SyntaxError: invalid syntax
print(); ; print() # SyntaxError: invalid syntax
; ; # SyntaxError: invalid syntax

My interpretation is that a semicolon by itself is treated as a block statement (i.e. illegal in a non-block statement context). Since a semicolon is not treated the same way as a newline, the EolToken should be a special token, not a semicolon.

See also
IndentTokenGenerator<Token>

Properties

int BracketDepth [get]
 
int CurrentIndent [get]
 
IListSource< int > OuterIndents [get]
 
int[] AllIndentTriggers [get, set]
 
int[] EolIndentTriggers [get, set]
 
Token EolToken [get, set]
 Gets or sets the prototype token for end-statement (a.k.a. end-of-line) markers, cast to an integer as required by Token. Use null to avoid generating such markers. More...
 
Token IndentToken [get, set]
 Gets or sets the prototype token for indentation markers. More...
 
Token DedentToken [get, set]
 Gets or sets the prototype token for unindentation markers. More...
 
- Properties inherited from Loyc.Syntax.Lexing.LexerWrapper< Token >
ILexer< TokenLexer [get, set]
 
ISourceFile SourceFile [get]
 
virtual IMessageSink ErrorSink [get, set]
 
int IndentLevel [get]
 
UString IndentString [get]
 
int LineNumber [get]
 
int InputPosition [get]
 
string FileName [get]
 
- Properties inherited from Loyc.Syntax.Lexing.ILexer< Token >
ISourceFile SourceFile [get]
 The file being lexed. More...
 
IMessageSink ErrorSink [get, set]
 Event handler for errors. More...
 
int IndentLevel [get]
 Indentation level of the current line. This is updated after scanning the first whitespaces on a new line, and may be reset to zero when NextToken() returns a newline. More...
 
UString IndentString [get]
 Gets a string slice that holds the spaces or tabs that were used to indent the current line. More...
 
int LineNumber [get]
 Current line number (1 for the first line). More...
 
int InputPosition [get]
 Current input position (an index into SourceFile.Text). More...
 
- Properties inherited from Loyc.Syntax.IIndexToLine
string FileName [get]
 Gets the file name used in results returned by IndexToLine(int). More...
 

Public Member Functions

 IndentTokenGenerator (ILexer< Token > lexer)
 Initializes the indent detector. More...
 
abstract TokenCategory GetTokenCategory (Token token)
 Gets the category of a token for the purposes of indent processing. More...
 
override void Reset ()
 
override Maybe< TokenNextToken ()
 Returns the next (postprocessed) token. This method should set the _current field to the returned value. More...
 
 IndentTokenGenerator (ILexer< Token > lexer, int[] allIndentTriggers, Token?eolToken, Token indentToken, Token dedentToken)
 Initializes the indent detector. More...
 
 IndentTokenGenerator (ILexer< Token > lexer, int[] allIndentTriggers, Token?eolToken)
 
override TokenCategory GetTokenCategory (Token token)
 
- Public Member Functions inherited from Loyc.Syntax.Lexing.LexerWrapper< Token >
 LexerWrapper (ILexer< Token > sourceLexer)
 
SourcePos IndexToLine (int index)
 Returns the position in a source file of the specified index. More...
 

Protected Member Functions

abstract Maybe< TokenMakeIndentToken (Token indentTrigger, ref Maybe< Token > tokenAfterward, bool newlineAfter)
 Returns a token to represent indentation, or null to suppress generating an indent-dedent pair at this point. More...
 
abstract IEnumerator< TokenMakeDedentToken (Token tokenBeforeNewline, ref Maybe< Token > tokenAfterNewline)
 Returns token(s) to represent un-indentation. More...
 
abstract Maybe< TokenMakeEndOfLineToken (Token tokenBeforeNewline, ref Maybe< Token > tokenAfterNewline, int?deltaIndent)
 Returns a token to represent the end of a line, or null to avoid generating such a token. More...
 
virtual bool IndentChangedUnexpectedly (Token tokenBeforeNewline, ref Maybe< Token > tokenAfterNewline, ref int deltaIndent)
 A method that is called when the indent level changed without a corresponding indent trigger. More...
 
virtual object IndexToMsgContext (Token token)
 Gets the context for use in error messages, which by convention is a SourceRange. More...
 
virtual void CheckForIndentStyleMismatch (UString indent1, UString indent2, Token next)
 
bool Contains (int[] list, int item)
 
override Maybe< TokenMakeIndentToken (Token indentTrigger, ref Maybe< Token > tokenAfterward, bool newlineAfter)
 
override IEnumerator< TokenMakeDedentToken (Token tokenBeforeDedent, ref Maybe< Token > tokenAfterDedent)
 
override Maybe< TokenMakeEndOfLineToken (Token tokenBeforeNewline, ref Maybe< Token > tokenAfterNewline, int?deltaIndent)
 
- Protected Member Functions inherited from Loyc.Syntax.Lexing.LexerWrapper< Token >
void WriteError (int index, string msg, params object[] args)
 

Additional Inherited Members

- Protected fields inherited from Loyc.Syntax.Lexing.LexerWrapper< Token >
Maybe< Token_current
 

Constructor & Destructor Documentation

Initializes the indent detector.

Parameters
lexerOriginal lexer (either a raw lexer or an instance of another preprocessor such as TokensToTree.)

References Loyc.Syntax.Lexing.Other.

Loyc.Syntax.Lexing.IndentTokenGenerator< Token >.IndentTokenGenerator ( ILexer< Token lexer,
int[]  allIndentTriggers,
Token eolToken,
Token  indentToken,
Token  dedentToken 
)
inline

Initializes the indent detector.

Parameters
lexerOriginal lexer
allIndentTriggersA list of all token types that could trigger the insertion of an indentation token.
eolTokenPrototype token for end-statement markers inserted when newlines are encountered, or null to avoid generating such markers.
indentTokenPrototype token for indentation markers
dedentTokenPrototype token for un-indent markers

Member Function Documentation

abstract TokenCategory Loyc.Syntax.Lexing.IndentTokenGenerator< Token >.GetTokenCategory ( Token  token)
pure virtual

Gets the category of a token for the purposes of indent processing.

virtual bool Loyc.Syntax.Lexing.IndentTokenGenerator< Token >.IndentChangedUnexpectedly ( Token  tokenBeforeNewline,
ref Maybe< Token tokenAfterNewline,
ref int  deltaIndent 
)
inlineprotectedvirtual

A method that is called when the indent level changed without a corresponding indent trigger.

Parameters
tokenBeforeNewlineFinal non-whitespace token before the newline.
tokenAfterNewlineFirst non-whitespace token after the newline. Though it's a Maybe<T>, it always has a value, but this function can suppress its emission by setting it to NoValue.Value.
deltaIndentAmount of unexpected indentation (positive or negative). On return, this parameter holds the amount by which to change the CurrentIndent; the default implementation leaves this value unchanged, which means that subsequent lines will be expected to be indented by the same (unexpected) amount.
Returns
true if MakeEndOfLineToken should be called as usual, or false to suppress EOL genertion. EOL can only be suppressed in case of an unexpected indent (deltaIndent>0), not an unindent.

The default implementation always returns true. It normally writes an error message, but switches to a warning in case OuterIndents[OuterIndents.Count-1] == OuterIndents[OuterIndents.Count-2], which this class interprets as a single unindent.

virtual object Loyc.Syntax.Lexing.IndentTokenGenerator< Token >.IndexToMsgContext ( Token  token)
inlineprotectedvirtual

Gets the context for use in error messages, which by convention is a SourceRange.

The base class uses Lexer.InputPosition as a fallback if the token doesn't implement ISimpleToken{int}.

References Loyc.UString.Substring().

abstract IEnumerator<Token> Loyc.Syntax.Lexing.IndentTokenGenerator< Token >.MakeDedentToken ( Token  tokenBeforeNewline,
ref Maybe< Token tokenAfterNewline 
)
protectedpure virtual

Returns token(s) to represent un-indentation.

Parameters
tokenBeforeNewlineThe last non-whitespace token before dedent
tokenAfterNewlineThe first non-whitespace un-indented token after the unindent, or NoValue at the end of the file. The derived class is allowed to change this token, or delete it by changing it to NoValue.

This class considers the indented block to be "over" even if this method returns no tokens.

abstract Maybe<Token> Loyc.Syntax.Lexing.IndentTokenGenerator< Token >.MakeEndOfLineToken ( Token  tokenBeforeNewline,
ref Maybe< Token tokenAfterNewline,
int?  deltaIndent 
)
protectedpure virtual

Returns a token to represent the end of a line, or null to avoid generating such a token.

Parameters
tokenBeforeNewlineFinal non-whitespace token before the newline was encountered.
tokenAfterNewlineFirst non-whitespace token after newline.
deltaIndentChange of indentation after the newline, or null if a dedent token is about to be inserted after the newline.

This function is also called at end-of-file, unless there are no tokens in the file.

abstract Maybe<Token> Loyc.Syntax.Lexing.IndentTokenGenerator< Token >.MakeIndentToken ( Token  indentTrigger,
ref Maybe< Token tokenAfterward,
bool  newlineAfter 
)
protectedpure virtual

Returns a token to represent indentation, or null to suppress generating an indent-dedent pair at this point.

Parameters
indentTriggerThe token that triggered this function call.
tokenAfterwardThe token after the indent trigger, or NoValue at EOF.
newlineAftertrue if the next non-whitespace token after indentTrigger is on a different line, or if EOF comes afterward.
override Maybe<Token> Loyc.Syntax.Lexing.IndentTokenGenerator< Token >.NextToken ( )
inlinevirtual

Returns the next (postprocessed) token. This method should set the _current field to the returned value.

Implements Loyc.Syntax.Lexing.LexerWrapper< Token >.

References Loyc.Syntax.Lexing.Token.Value.

Property Documentation

Gets or sets the prototype token for unindentation markers.

The StartIndex is updated for each actual token emitted.

Gets or sets the prototype token for end-statement (a.k.a. end-of-line) markers, cast to an integer as required by Token. Use null to avoid generating such markers.

Note: if the last token on a line has this same type, this class will not generate an extra newline token.

The StartIndex is updated for each actual token emitted.

Gets or sets the prototype token for indentation markers.

The StartIndex is updated for each actual token emitted.