Enhanced C#
Language of your choice: library documentation
Public fields | Public static fields | Properties | Public Member Functions | Static Public Member Functions | List of all members
Loyc.Syntax.Lexing.Token Struct Reference

A common token type recommended for Loyc languages that want to use features such as token literals or the TokensToTree class. More...


Source file:
Inheritance diagram for Loyc.Syntax.Lexing.Token:
Loyc.Collections.IListSource< Token > Loyc.Syntax.Lexing.IToken< int >

Remarks

A common token type recommended for Loyc languages that want to use features such as token literals or the TokensToTree class.

For performance reasons, a Token ought to be a structure rather than a class. But if Token is a struct, we have a conundrum: how do we support tokens from different languages? (We can't use inheritance in structs.)

Luckily, tokens in most languages are very similar. A four-word structure generally suffices:

  1. TypeInt: each language can use a different set of token types represented by a different enum. All enums can be converted to an integer, so Token uses Int32 as the token type. In order to support DSLs via token literals (e.g. LLLPG is a DSL inside EC#), the TypeInt should be based on TokenKind.
  2. Value: this can be any object. For literals, this should be the actual value of the literal, for whitespace it should be WhitespaceTag.Value, etc. See Value for the complete list.
  3. StartIndex: location in the original source file where the token starts.
  4. Length: length of the token in the source file (32 bits).

Since 64-bit platforms are very common, the Value is 64 bits, and padding increases the structure size from 16 bytes to 24. Given this reality, it was decided to fill in the 4 bytes of padding with additional information:

  1. Style: 8 bits of style information, e.g. it can be used to mark whether integer literals use hexadecimal, binary or decimal format.
  2. TextValue range: some constructors create an "uninterpreted literal" which is able to keep track of two values: the text of a literal, obtainable by calling TextValue(ICharSource), plus a type marker returned from TypeMarker (uninterpreted literals do not use the Value property). 16 bits of information enables the TextValue feature to work without memory allocation in many cases; see the documentation of the constructor Token(int, int, UString, NodeStyle, object, int, int) for more information about the purpose and usage of this feature.

To save space (and because .NET doesn't handle large structures well), tokens do not know what source file they came from and cannot convert their location to a line number. For this reason, one should keep a reference to the ISourceFile separately. You can then call SourceText(ISourceFile.Text) to get the original source text, or IIndexToLine.IndexToLine(int) to get the source line number.

A generic token also cannot convert itself to a properly-formatted string. The ToString method does allow you to provide an optional reference to ICharSource which allows the token to get its original text, and in any case you can call SetToStringStrategy to control the method by which a token converts itself to a string.

Fun fact: originally I planned to use Symbol as the common token type, because it is extensible and could nicely represent tokens in all languages; unfortunately, Symbol may reduce parsing performance because it cannot be used with the switch opcode (i.e. the switch statement in C#), so I decided to indicate token types via integers instead. Each language should have, in the namespace of that language, an extension method public static TokenType Type(this Token t) that converts the TypeInt to the enum type for that language. Optionally, the TokenType enum for your language can be based on TokenKind so that the Kind property returns a meaningful value.

Public fields

int TypeInt => _typeInt
 Token type. More...
 
int StartIndex => _startIndex
 Location in the orginal source file where the token starts, or -1 for a synthetic token. More...
 
int Length => _length
 Length of the token in the source file, or 0 for a synthetic or implied token. More...
 
object Value => IsUninterpretedLiteral ? null : _value
 The parsed value of the token, if this structure was initialized with one of the constructors that accepts a value. More...
 
bool IsUninterpretedLiteral => (_stuff & 0x01000000) != 0
 
NodeStyle Style => (NodeStyle)_stuff
 8 bits of nonsemantic information about the token. The style is used to distinguish hex literals from decimal literals, or triple- quoted strings from double-quoted strings. More...
 
int ISimpleToken< int >. Type => TypeInt
 

Public static fields

static readonly ThreadLocalVariable< Func< Token, ICharSource, string > > ToStringStrategyTLV = new ThreadLocalVariable<Func<Token,ICharSource,string>>(Loyc.Syntax.Les.TokenExt.ToString)
 

Properties

TokenKind Kind [get]
 Token category. This value is only meaningful if the token type integers are based on TokenKinds. Token types for LES and Enhanced C# are, indeed, based on TokenKind. More...
 
Symbol TypeMarker [get]
 Gets the type marker stored in this token, if this token was initialized with one of the constructors that accepts a type marker. More...
 
TokenTree Children [get]
 Returns Value as TokenTree (null if not a TokenTree). More...
 
int EndIndex [get]
 Returns StartIndex + Length. More...
 
bool IsWhitespace [get]
 Returns true if Value == WhitespaceTag.Value. More...
 
static Func< Token, ICharSource, string >?? ToStringStrategy [get, set]
 Gets or sets the strategy used by ToString. More...
 
Token this[int index] [get]
 
int? Count [get]
 
IListSource< IToken< int > > IToken< int >. Children [get]
 
- Properties inherited from Loyc.Syntax.Lexing.IToken< int >
int Length [get]
 
TokenKind Kind [get]
 
IListSource< IToken< TT > > Children [get]
 

Public Member Functions

 Token (int type, int startIndex, int length, NodeStyle style=0, object value=null)
 Initializes the Token structure. More...
 
 Token (int type, int startIndex, int length, object value)
 
 Token (int type, int startIndex, UString sourceText, NodeStyle style, Symbol typeMarker, int substringStart, int substringEnd)
 Initializes an "uninterpreted literal" token designed to store two parts of a literal without allocating extra memory (see the Remarks for details). More...
 
 Token (int type, int startIndex, int length, NodeStyle style, Symbol typeMarker, UString textValue)
 Initializes an "uninterpreted literal" token (see the Remarks). More...
 
 Token (int type, int startIndex, UString sourceText, NodeStyle style, object valueOrTypeMarker, UString textValue)
 Initializes a kind of token designed to store two parts of a literal (see the Remarks for details). More...
 
bool Is (int type, object value)
 Returns true if the specified type and value match this token. More...
 
SourceRange Range (ISourceFile sf)
 Gets the SourceRange of a token, under the assumption that the token came from the specified source file. More...
 
SourceRange Range (ILexer< Token > l)
 
UString SourceText (ICharSource chars)
 Gets the original source text for a token if available, under the assumption that the specified source file correctly specifies where the token came from. If the token is synthetic, returns UString.Null. More...
 
UString SourceText (ILexer< Token > l)
 
UString TextValue (ICharSource source)
 Helps get the "text value" from tokens that used one of the constructors designed to support this use case, e.g. Token(int type, int startIndex, UString tokenText, NodeStyle style, object value, int valueStart, int valueEnd). If one of the other constructors was used, this function returns the same value as SourceText(ICharSource). More...
 
UString TextValue (ILexer< Token > source)
 
override string ToString ()
 Reconstructs a string that represents the token, if possible. Does not work for whitespace and comments, because the value of these token types is stored in the original source file and for performance reasons is not copied to the token. More...
 
string ToString (ICharSource sourceText)
 Gets the original text of the token, if you provide a reference to the original source code text. Note: the method used to convert the token to a string can be overridden with SetToStringStrategy. More...
 
override bool Equals (object obj)
 
bool Equals (Token other)
 Equality depends on TypeInt and Value, but not StartIndex and Length (this is the same equality condition as LNode). More...
 
override int GetHashCode ()
 
Token TryGet (int index, out bool fail)
 
IEnumerator< TokenGetEnumerator ()
 
System.Collections.IEnumerator System.Collections.IEnumerable. GetEnumerator ()
 
IRange< Token > IListSource< Token >. Slice (int start, int count)
 
Slice_< TokenSlice (int start, int count)
 
IToken< int > IToken< int >. WithType (int type)
 
Token WithType (int type)
 
IToken< int > IToken< int >. WithValue (object value)
 
Token WithValue (object value)
 
Token WithRange (int startIndex, int endIndex)
 
Token WithStartIndex (int startIndex)
 
IToken< int > ICloneable< IToken< int > >. Clone ()
 
LNode ToLNode (ISourceFile file)
 
- Public Member Functions inherited from Loyc.Collections.IListSource< Token >
IRange< T > Slice (int start, int count=int.MaxValue)
 Returns a sub-range of this list. More...
 
- Public Member Functions inherited from Loyc.Syntax.Lexing.IToken< int >
IToken< TT > WithType (int type)
 
IToken< TT > WithValue (object value)
 

Static Public Member Functions

static int Stuff (NodeStyle style, byte substringOffset, byte substringOffsetFromEnd, bool isUninterpretedLiteral)
 
static SavedValue< Func< Token, ICharSource, string > > SetToStringStrategy (Func< Token, ICharSource, string > newValue)
 
static bool IsOpener (TokenKind tt)
 
static bool IsCloser (TokenKind tt)
 
static bool IsOpenerOrCloser (TokenKind tt)
 
static Symbol GetParenPairSymbol (TokenKind k, TokenKind k2)
 

Constructor & Destructor Documentation

◆ Token() [1/4]

Loyc.Syntax.Lexing.Token.Token ( int  type,
int  startIndex,
int  length,
NodeStyle  style = 0,
object  value = null 
)
inline

Initializes the Token structure.

Parameters
typeValue of TypeInt
startIndexValue of StartIndex
lengthValue of Length
styleValue of Style
valueValue of Value

◆ Token() [2/4]

Loyc.Syntax.Lexing.Token.Token ( int  type,
int  startIndex,
UString  sourceText,
NodeStyle  style,
Symbol  typeMarker,
int  substringStart,
int  substringEnd 
)
inline

Initializes an "uninterpreted literal" token designed to store two parts of a literal without allocating extra memory (see the Remarks for details).

Parameters
typeValue of TypeInt
startIndexValue of StartIndex
sourceTextA substring of the token in the original source file, such that Length will be sourceText.Length and sourceText.Substring(valueStart - startIndex, valueEnd - valueStart) will be returned from TextValue(ICharSource). For correct results, the ICharSource passed to TextValue later needs to represent the same string that was used to produce this parameter.
styleValue of Style
typeMarkerValue of TypeMarker.
substringStartIndex where the TextValue starts in the source code; should be equal to or greater than startIndex.
substringEndIndex where the TextValue ends in the source code; should be equal to or less than startIndex + tokenText.Length.

Literals in many languages can be broken into two textual parts: their type and their value. For example, in some languages you can write 123.5f, where "f" indicates that the floating-point value has a size of 32 bits. C++ strings have up to three parts, as in u"Hello"_UD: u indicates the character type (u = 16-bit unicode) while _UD indicates that the string should be interpreted in a user-defined way. In LES3, all literals have two parts: value text and a type marker. For example, 123.5f has a text "123.5" and type marker "_f"; greeting"Hello" has text "Hello" and type marker "greeting"; and a simple number like 123 has text "123" and type marker "_".

This constructor allows you to represent up to two "values" in a single token without necessarily allocating memory for them, even though Tokens only contain a single heap reference. When calling this constructor, the second value, called the "TextValue", must be a substring of the token's original source text; for example given the token "Hello", the tokenizer would use Hello as the TextValue. Rather than allocating a string "Hello" and storing it in the token, you can use this constructor to record the fact that the string Hello begins one character after the beginning of the token (valueStart = 1) and one character before the end of the token (valueEnd = startIndex + tokenText.Length - 1). When using this contructor, the Token's Value property returns null; internally the value reference points to the type marker, which is returned from the TypeMarker property rather than Value.

Since a Token does not have a reference to its own source file (ISourceFile), the language parser will need to use the TextValue(ICharSource) method to retrieve the value text later.

Token is a small structure that allocates only 8 bits for the offset between the TextValue and the beginning/end of the sourceText (16 bits total). If the start offset is above 254, the TextValue is combined with the TypeMarker in a heap object of type Tuple<Symbol, UString>, but this is a hidden implementation detail.

For strings that contain escape sequences, such as "Hello\n", you may prefer to store a parsed version of the string in the Token. There is another constructor for this purpose, which always allocates memory: Token(int, int, int, NodeStyle, Symbol, UString).

References Loyc.UString.Length, and Loyc.UString.Slice().

◆ Token() [3/4]

Loyc.Syntax.Lexing.Token.Token ( int  type,
int  startIndex,
int  length,
NodeStyle  style,
Symbol  typeMarker,
UString  textValue 
)
inline

Initializes an "uninterpreted literal" token (see the Remarks).

Parameters
typeValue of TypeInt
startIndexValue of StartIndex
lengthValue of Length
styleValue of Style.
typeMarkerValue of TypeMarker.
textValueValue returned from TextValue(ICharSource).

As explained in the documentation of the other constructor (Token(ushort, int, UString, NodeStyle, object, int, int), some literals have two parts which we call the TypeMarker and the TextValue. Since the Token structure only contains a single heap reference, this contructor combines TypeMarker with TextValue in a heap object, but this is a hidden implementation detail; just use TypeMarker and TextValue(ICharSource) to retrieve the values.

◆ Token() [4/4]

Loyc.Syntax.Lexing.Token.Token ( int  type,
int  startIndex,
UString  sourceText,
NodeStyle  style,
object  valueOrTypeMarker,
UString  textValue 
)
inline

Initializes a kind of token designed to store two parts of a literal (see the Remarks for details).

Parameters
typeValue of TypeInt
startIndexValue of StartIndex
sourceTextA substring of the token in the original source file (something returned from ICharSource.Slice(int, int)), such that Length will be sourceText.Length and SourceText(ICharSource) will return this same string if it is correctly given the same ICharSource object.
styleValue of Style.
valueOrTypeMarkerValue of TypeMarker if you are creating an uninterpreted literal or Value if you are not (according to the textValue parameter.)
textValueIf this Token does NOT represent an uninterpreted literal, this parameter must be default(UString). In any case, this parameter will become the value of TextValue(ICharSource) if that method is correctly given the same ICharSource object from which sourceText was extracted.

As explained in the documentation of the other constructor (Token(int, int, UString, NodeStyle, Symbol, int, int), some literals have two parts which we call the Value and the TextValue. This constructor is designed to be used when the TextValue is sometimes a substring of the source code and sometimes merely derived from the source code. For example, given the literal "Hello", the correct TextValue is the five characters Hello, but given the C literal "Hi!\n", you may wish to translate the escape characters in the lexer, and create a Token that refers to the four decoded characters Hi!
(where
represents a newline) rather than the five characters of Hi!
in the original source code.

This constructor uses memory intelligently. If textValue is a substring of sourceText, or if textValue.Length is zero, it will avoid allocating memory for a reference to textValue (the optimization is described in more detail in the other constructor's documentation.)

References Loyc.UString.InternalString, Loyc.UString.IsNull, and Loyc.UString.Length.

Member Function Documentation

◆ Equals()

bool Loyc.Syntax.Lexing.Token.Equals ( Token  other)
inline

Equality depends on TypeInt and Value, but not StartIndex and Length (this is the same equality condition as LNode).

References Loyc.Syntax.Lexing.Token.TypeInt, and Loyc.Syntax.Lexing.Token.Value.

◆ Is()

bool Loyc.Syntax.Lexing.Token.Is ( int  type,
object  value 
)

Returns true if the specified type and value match this token.

◆ Range()

SourceRange Loyc.Syntax.Lexing.Token.Range ( ISourceFile  sf)
inline

◆ SourceText()

UString Loyc.Syntax.Lexing.Token.SourceText ( ICharSource  chars)
inline

Gets the original source text for a token if available, under the assumption that the specified source file correctly specifies where the token came from. If the token is synthetic, returns UString.Null.

References Loyc.Syntax.Lexing.Token.Length, Loyc.Collections.ICharSource.Slice(), and Loyc.Syntax.Lexing.Token.StartIndex.

◆ TextValue()

UString Loyc.Syntax.Lexing.Token.TextValue ( ICharSource  source)
inline

Helps get the "text value" from tokens that used one of the constructors designed to support this use case, e.g. Token(int type, int startIndex, UString tokenText, NodeStyle style, object value, int valueStart, int valueEnd). If one of the other constructors was used, this function returns the same value as SourceText(ICharSource).

Parameters
charsOriginal source code or lexer from which this token was derived.

References Loyc.Syntax.Lexing.Token.Length, Loyc.Collections.ICharSource.Slice(), and Loyc.Syntax.Lexing.Token.StartIndex.

Referenced by Loyc.Syntax.LNodeFactory.Literal(), and Loyc.Syntax.LNodeFactory.UninterpretedLiteral().

◆ ToString() [1/2]

override string Loyc.Syntax.Lexing.Token.ToString ( )

Reconstructs a string that represents the token, if possible. Does not work for whitespace and comments, because the value of these token types is stored in the original source file and for performance reasons is not copied to the token.

This does not return the original source text; it uses the stringizer in ToStringStrategy, which can be overridden with language- specific behavior by calling SetToStringStrategy.

The returned string, in general, will not match the original token, since the ToStringStrategy does not have access to the original source file.

◆ ToString() [2/2]

string Loyc.Syntax.Lexing.Token.ToString ( ICharSource  sourceText)

Gets the original text of the token, if you provide a reference to the original source code text. Note: the method used to convert the token to a string can be overridden with SetToStringStrategy.

Member Data Documentation

◆ Length

int Loyc.Syntax.Lexing.Token.Length => _length

◆ StartIndex

int Loyc.Syntax.Lexing.Token.StartIndex => _startIndex

◆ Style

NodeStyle Loyc.Syntax.Lexing.Token.Style => (NodeStyle)_stuff

8 bits of nonsemantic information about the token. The style is used to distinguish hex literals from decimal literals, or triple- quoted strings from double-quoted strings.

Referenced by Loyc.Syntax.LNodeFactory.LiteralFromValueOf(), Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().

◆ TypeInt

int Loyc.Syntax.Lexing.Token.TypeInt => _typeInt

◆ Value

object Loyc.Syntax.Lexing.Token.Value => IsUninterpretedLiteral ? null : _value

The parsed value of the token, if this structure was initialized with one of the constructors that accepts a value.

Recommended ways to use this property:

  • For strings: the parsed value of the string (no quotes, escape sequences removed), i.e. a boxed char or a string. A backquoted string in EC#/LES is converted to a Symbol because it is a kind of operator.
  • For numbers: the parsed value of the number (e.g. 4 => int, 4L => long, 4.0f => float)
  • For identifiers: the parsed name of the identifier, as a Symbol (e.g. x => x, @for => for, <tt>1+1 => 1+1)
  • For any keyword including AttrKeyword and TypeKeyword tokens: a Symbol containing the name of the keyword, with "#" prefix
  • For punctuation and operators: the text of the punctuation as a Symbol.
  • For openers (open paren, open brace, etc.): null for normal linear parsers. If the tokens have been processed by TokensToTree, this will be a TokenTree.
  • For spaces and comments: for performance reasons, it is not recommended to extract the text of whitespace from the source file; instead, use WhitespaceTag.Value
  • When no value is needed (because the Type() is enough): null

Referenced by Loyc.Syntax.Lexing.Token.Equals(), Loyc.Syntax.LNodeFactory.LiteralFromValueOf(), Loyc.Ecs.Parser.EcsTriviaInjector.MakeTriviaAttribute(), Loyc.Syntax.StandardTriviaInjector.MakeTriviaAttribute(), Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().

Property Documentation

◆ Children

TokenTree Loyc.Syntax.Lexing.Token.Children
get

Returns Value as TokenTree (null if not a TokenTree).

Referenced by Loyc.Syntax.Lexing.TokenTree.TokenToLNode().

◆ EndIndex

int Loyc.Syntax.Lexing.Token.EndIndex
get

◆ IsWhitespace

bool Loyc.Syntax.Lexing.Token.IsWhitespace
get

Returns true if Value == WhitespaceTag.Value.

◆ Kind

TokenKind Loyc.Syntax.Lexing.Token.Kind
get

Token category. This value is only meaningful if the token type integers are based on TokenKinds. Token types for LES and Enhanced C# are, indeed, based on TokenKind.

Referenced by Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().

◆ ToStringStrategy

Func<Token, ICharSource, string>?? Loyc.Syntax.Lexing.Token.ToStringStrategy
staticgetset

Gets or sets the strategy used by ToString.

◆ TypeMarker

Symbol Loyc.Syntax.Lexing.Token.TypeMarker
get

Gets the type marker stored in this token, if this token was initialized with one of the constructors that accepts a type marker.

Referenced by Loyc.Syntax.LNodeFactory.Literal(), and Loyc.Syntax.LNodeFactory.UninterpretedLiteral().