A common token type recommended for Loyc languages that want to use features such as token literals or the TokensToTree class. More...

Source file:

/Core/Loyc.Syntax/Lexing/Token.cs

Inheritance diagram for Loyc.Syntax.Lexing.Token:

Remarks

A common token type recommended for Loyc languages that want to use features such as token literals or the TokensToTree class.

For performance reasons, a Token ought to be a structure rather than a class. But if Token is a struct, we have a conundrum: how do we support tokens from different languages? (We can't use inheritance in structs.)

Luckily, tokens in most languages are very similar. A four-word structure generally suffices:

TypeInt: each language can use a different set of token types represented by a different enum. All enums can be converted to an integer, so Token uses Int32 as the token type. In order to support DSLs via token literals (e.g. LLLPG is a DSL inside EC#), the TypeInt should be based on TokenKind.
Value: this can be any object. For literals, this should be the actual value of the literal, for whitespace it should be WhitespaceTag.Value, etc. See Value for the complete list.
StartIndex: location in the original source file where the token starts.
Length: length of the token in the source file (32 bits).

Since 64-bit platforms are very common, the Value is 64 bits, and padding increases the structure size from 16 bytes to 24. Given this reality, it was decided to fill in the 4 bytes of padding with additional information:

Style: 8 bits of style information, e.g. it can be used to mark whether integer literals use hexadecimal, binary or decimal format.
TextValue range: some constructors create an "uninterpreted literal" which is able to keep track of two values: the text of a literal, obtainable by calling TextValue(ICharSource), plus a type marker returned from TypeMarker (uninterpreted literals do not use the Value property). 16 bits of information enables the TextValue feature to work without memory allocation in many cases; see the documentation of the constructor Token(int, int, UString, NodeStyle, object, int, int) for more information about the purpose and usage of this feature.

To save space (and because .NET doesn't handle large structures well), tokens do not know what source file they came from and cannot convert their location to a line number. For this reason, one should keep a reference to the ISourceFile separately. You can then call SourceText(ISourceFile.Text) to get the original source text, or IIndexToLine.IndexToLine(int) to get the source line number.

A generic token also cannot convert itself to a properly-formatted string. The ToString method does allow you to provide an optional reference to ICharSource which allows the token to get its original text, and in any case you can call SetToStringStrategy to control the method by which a token converts itself to a string.

Fun fact: originally I planned to use Symbol as the common token type, because it is extensible and could nicely represent tokens in all languages; unfortunately, Symbol may reduce parsing performance because it cannot be used with the switch opcode (i.e. the switch statement in C#), so I decided to indicate token types via integers instead. Each language should have, in the namespace of that language, an extension method public static TokenType Type(this Token t) that converts the TypeInt to the enum type for that language. Optionally, the TokenType enum for your language can be based on TokenKind so that the Kind property returns a meaningful value.

Public fields
int	TypeInt => _typeInt
	Token type. More...

int	StartIndex => _startIndex
	Location in the orginal source file where the token starts, or -1 for a synthetic token. More...

int	Length => _length
	Length of the token in the source file, or 0 for a synthetic or implied token. More...

object	Value => IsUninterpretedLiteral ? null : _value
	The parsed value of the token, if this structure was initialized with one of the constructors that accepts a value. More...

bool	IsUninterpretedLiteral => (_stuff & 0x01000000) != 0

NodeStyle	Style => (NodeStyle)_stuff
	8 bits of nonsemantic information about the token. The style is used to distinguish hex literals from decimal literals, or triple- quoted strings from double-quoted strings. More...

int ISimpleToken< int >.	Type => TypeInt

Public static fields
static readonly ThreadLocalVariable< Func< Token, ICharSource, string > >	ToStringStrategyTLV = new ThreadLocalVariable<Func<Token,ICharSource,string>>(Loyc.Syntax.Les.TokenExt.ToString)

Properties
TokenKind	Kind `[get]`
	Token category. This value is only meaningful if the token type integers are based on TokenKinds. Token types for LES and Enhanced C# are, indeed, based on TokenKind. More...

Symbol	TypeMarker `[get]`
	Gets the type marker stored in this token, if this token was initialized with one of the constructors that accepts a type marker. More...

TokenTree	Children `[get]`
	Returns Value as TokenTree (null if not a TokenTree). More...

int	EndIndex `[get]`
	Returns StartIndex + Length. More...

bool	IsWhitespace `[get]`
	Returns true if Value == WhitespaceTag.Value. More...

static Func< Token, ICharSource, string >??	ToStringStrategy `[get, set]`
	Gets or sets the strategy used by ToString. More...

Token	this[int index] `[get]`

int?	Count `[get]`

IListSource< IToken< int > > IToken< int >.	Children `[get]`

Properties inherited from Loyc.Syntax.Lexing.IToken< int >
int	Length `[get]`

TokenKind	Kind `[get]`

IListSource< IToken< TT > >	Children `[get]`

Public Member Functions
	Token (int type, int startIndex, int length, NodeStyle style=0, object value=null)
	Initializes the Token structure. More...

	Token (int type, int startIndex, int length, object value)

	Token (int type, int startIndex, UString sourceText, NodeStyle style, Symbol typeMarker, int substringStart, int substringEnd)
	Initializes an "uninterpreted literal" token designed to store two parts of a literal without allocating extra memory (see the Remarks for details). More...

	Token (int type, int startIndex, int length, NodeStyle style, Symbol typeMarker, UString textValue)
	Initializes an "uninterpreted literal" token (see the Remarks). More...

	Token (int type, int startIndex, UString sourceText, NodeStyle style, object valueOrTypeMarker, UString textValue)
	Initializes a kind of token designed to store two parts of a literal (see the Remarks for details). More...

bool	Is (int type, object value)
	Returns true if the specified type and value match this token. More...

SourceRange	Range (ISourceFile sf)
	Gets the SourceRange of a token, under the assumption that the token came from the specified source file. More...

SourceRange	Range (ILexer< Token > l)

UString	SourceText (ICharSource chars)
	Gets the original source text for a token if available, under the assumption that the specified source file correctly specifies where the token came from. If the token is synthetic, returns UString.Null. More...

UString	SourceText (ILexer< Token > l)

UString	TextValue (ICharSource source)
	Helps get the "text value" from tokens that used one of the constructors designed to support this use case, e.g. Token(int type, int startIndex, UString tokenText, NodeStyle style, object value, int valueStart, int valueEnd). If one of the other constructors was used, this function returns the same value as SourceText(ICharSource). More...

UString	TextValue (ILexer< Token > source)

override string	ToString ()
	Reconstructs a string that represents the token, if possible. Does not work for whitespace and comments, because the value of these token types is stored in the original source file and for performance reasons is not copied to the token. More...

string	ToString (ICharSource sourceText)
	Gets the original text of the token, if you provide a reference to the original source code text. Note: the method used to convert the token to a string can be overridden with SetToStringStrategy. More...

override bool	Equals (object obj)

bool	Equals (Token other)
	Equality depends on TypeInt and Value, but not StartIndex and Length (this is the same equality condition as LNode). More...

override int	GetHashCode ()

Token	TryGet (int index, out bool fail)

IEnumerator< Token >	GetEnumerator ()

System.Collections.IEnumerator System.Collections.IEnumerable.	GetEnumerator ()

IRange< Token > IListSource< Token >.	Slice (int start, int count)

Slice_< Token >	Slice (int start, int count)

IToken< int > IToken< int >.	WithType (int type)

Token	WithType (int type)

IToken< int > IToken< int >.	WithValue (object value)

Token	WithValue (object value)

Token	WithRange (int startIndex, int endIndex)

Token	WithStartIndex (int startIndex)

IToken< int > ICloneable< IToken< int > >.	Clone ()

LNode	ToLNode (ISourceFile file)

Public Member Functions inherited from Loyc.Collections.IListSource< Token >
IRange< T >	Slice (int start, int count=int.MaxValue)
	Returns a sub-range of this list. More...

Public Member Functions inherited from Loyc.Syntax.Lexing.IToken< int >
IToken< TT >	WithType (int type)

IToken< TT >	WithValue (object value)

Static Public Member Functions
static int	Stuff (NodeStyle style, byte substringOffset, byte substringOffsetFromEnd, bool isUninterpretedLiteral)

static SavedValue< Func< Token, ICharSource, string > >	SetToStringStrategy (Func< Token, ICharSource, string > newValue)

static bool	IsOpener (TokenKind tt)

static bool	IsCloser (TokenKind tt)

static bool	IsOpenerOrCloser (TokenKind tt)

static Symbol	GetParenPairSymbol (TokenKind k, TokenKind k2)

Constructor & Destructor Documentation

◆ Token() [1/4]

Loyc.Syntax.Lexing.Token.Token	(	int	type,
		int	startIndex,
		int	length,
		NodeStyle	style = `0`,
		object	value = `null`
	)

inline

Initializes the Token structure.

Parameters

type	Value of TypeInt
startIndex	Value of StartIndex
length	Value of Length
style	Value of Style
value	Value of Value

◆ Token() [2/4]

Loyc.Syntax.Lexing.Token.Token	(	int	type,
		int	startIndex,
		UString	sourceText,
		NodeStyle	style,
		Symbol	typeMarker,
		int	substringStart,
		int	substringEnd
	)

inline

Initializes an "uninterpreted literal" token designed to store two parts of a literal without allocating extra memory (see the Remarks for details).

Parameters

type	Value of TypeInt
startIndex	Value of StartIndex
sourceText	A substring of the token in the original source file, such that Length will be `sourceText.Length` and `sourceText.Substring(valueStart - startIndex, valueEnd - valueStart)` will be returned from TextValue(ICharSource). For correct results, the ICharSource passed to TextValue later needs to represent the same string that was used to produce this parameter.
style	Value of Style
typeMarker	Value of TypeMarker.
substringStart	Index where the TextValue starts in the source code; should be equal to or greater than startIndex.
substringEnd	Index where the TextValue ends in the source code; should be equal to or less than startIndex + tokenText.Length.

Literals in many languages can be broken into two textual parts: their type and their value. For example, in some languages you can write 123.5f, where "f" indicates that the floating-point value has a size of 32 bits. C++ strings have up to three parts, as in u"Hello"_UD: u indicates the character type (u = 16-bit unicode) while _UD indicates that the string should be interpreted in a user-defined way. In LES3, all literals have two parts: value text and a type marker. For example, 123.5f has a text "123.5" and type marker "_f"; greeting"Hello" has text "Hello" and type marker "greeting"; and a simple number like 123 has text "123" and type marker "_".

This constructor allows you to represent up to two "values" in a single token without necessarily allocating memory for them, even though Tokens only contain a single heap reference. When calling this constructor, the second value, called the "TextValue", must be a substring of the token's original source text; for example given the token "Hello", the tokenizer would use Hello as the TextValue. Rather than allocating a string "Hello" and storing it in the token, you can use this constructor to record the fact that the string Hello begins one character after the beginning of the token (valueStart = 1) and one character before the end of the token (valueEnd = startIndex + tokenText.Length - 1). When using this contructor, the Token's Value property returns null; internally the value reference points to the type marker, which is returned from the TypeMarker property rather than Value.

Since a Token does not have a reference to its own source file (ISourceFile), the language parser will need to use the TextValue(ICharSource) method to retrieve the value text later.

Token is a small structure that allocates only 8 bits for the offset between the TextValue and the beginning/end of the sourceText (16 bits total). If the start offset is above 254, the TextValue is combined with the TypeMarker in a heap object of type Tuple<Symbol, UString>, but this is a hidden implementation detail.

For strings that contain escape sequences, such as "Hello\n", you may prefer to store a parsed version of the string in the Token. There is another constructor for this purpose, which always allocates memory: Token(int, int, int, NodeStyle, Symbol, UString).

References Loyc.UString.Length, and Loyc.UString.Slice().

◆ Token() [3/4]

Loyc.Syntax.Lexing.Token.Token	(	int	type,
		int	startIndex,
		int	length,
		NodeStyle	style,
		Symbol	typeMarker,
		UString	textValue
	)

inline

Initializes an "uninterpreted literal" token (see the Remarks).

Parameters

type	Value of TypeInt
startIndex	Value of StartIndex
length	Value of Length
style	Value of Style.
typeMarker	Value of TypeMarker.
textValue	Value returned from TextValue(ICharSource).

As explained in the documentation of the other constructor (Token(ushort, int, UString, NodeStyle, object, int, int), some literals have two parts which we call the TypeMarker and the TextValue. Since the Token structure only contains a single heap reference, this contructor combines TypeMarker with TextValue in a heap object, but this is a hidden implementation detail; just use TypeMarker and TextValue(ICharSource) to retrieve the values.

◆ Token() [4/4]

Loyc.Syntax.Lexing.Token.Token	(	int	type,
		int	startIndex,
		UString	sourceText,
		NodeStyle	style,
		object	valueOrTypeMarker,
		UString	textValue
	)

inline

Initializes a kind of token designed to store two parts of a literal (see the Remarks for details).

Parameters

type	Value of TypeInt
startIndex	Value of StartIndex
sourceText	A substring of the token in the original source file (something returned from ICharSource.Slice(int, int)), such that Length will be `sourceText.Length` and SourceText(ICharSource) will return this same string if it is correctly given the same ICharSource object.
style	Value of Style.
valueOrTypeMarker	Value of TypeMarker if you are creating an uninterpreted literal or Value if you are not (according to the textValue parameter.)
textValue	If this Token does NOT represent an uninterpreted literal, this parameter must be default(UString). In any case, this parameter will become the value of TextValue(ICharSource) if that method is correctly given the same ICharSource object from which `sourceText` was extracted.

As explained in the documentation of the other constructor (Token(int, int, UString, NodeStyle, Symbol, int, int), some literals have two parts which we call the Value and the TextValue. This constructor is designed to be used when the TextValue is sometimes a substring of the source code and sometimes merely derived from the source code. For example, given the literal "Hello", the correct TextValue is the five characters Hello, but given the C literal "Hi!\n", you may wish to translate the escape characters in the lexer, and create a Token that refers to the four decoded characters Hi! (where
represents a newline) rather than the five characters of Hi! in the original source code.

This constructor uses memory intelligently. If textValue is a substring of sourceText, or if textValue.Length is zero, it will avoid allocating memory for a reference to textValue (the optimization is described in more detail in the other constructor's documentation.)

References Loyc.UString.InternalString, Loyc.UString.IsNull, and Loyc.UString.Length.

Member Function Documentation

◆ Equals()

bool Loyc.Syntax.Lexing.Token.Equals ( Token other )

inline

Equality depends on TypeInt and Value, but not StartIndex and Length (this is the same equality condition as LNode).

References Loyc.Syntax.Lexing.Token.TypeInt, and Loyc.Syntax.Lexing.Token.Value.

◆ Is()

bool Loyc.Syntax.Lexing.Token.Is	(	int	type,
		object	value
	)

Returns true if the specified type and value match this token.

◆ Range()

SourceRange Loyc.Syntax.Lexing.Token.Range ( ISourceFile sf )

inline

Gets the SourceRange of a token, under the assumption that the token came from the specified source file.

References Loyc.Syntax.Lexing.Token.Length, and Loyc.Syntax.Lexing.Token.StartIndex.

Referenced by Loyc.Ecs.Parser.EcsTriviaInjector.MakeTriviaAttribute(), Loyc.Syntax.StandardTriviaInjector.MakeTriviaAttribute(), and Loyc.Syntax.LNodeFactory.UninterpretedLiteral().

◆ SourceText()

UString Loyc.Syntax.Lexing.Token.SourceText ( ICharSource chars )

inline

Gets the original source text for a token if available, under the assumption that the specified source file correctly specifies where the token came from. If the token is synthetic, returns UString.Null.

References Loyc.Syntax.Lexing.Token.Length, Loyc.Collections.ICharSource.Slice(), and Loyc.Syntax.Lexing.Token.StartIndex.

◆ TextValue()

UString Loyc.Syntax.Lexing.Token.TextValue ( ICharSource source )

inline

Helps get the "text value" from tokens that used one of the constructors designed to support this use case, e.g. Token(int type, int startIndex, UString tokenText, NodeStyle style, object value, int valueStart, int valueEnd). If one of the other constructors was used, this function returns the same value as SourceText(ICharSource).

Parameters

chars Original source code or lexer from which this token was derived.

References Loyc.Syntax.Lexing.Token.Length, Loyc.Collections.ICharSource.Slice(), and Loyc.Syntax.Lexing.Token.StartIndex.

Referenced by Loyc.Syntax.LNodeFactory.Literal(), and Loyc.Syntax.LNodeFactory.UninterpretedLiteral().

◆ ToString() [1/2]

override string Loyc.Syntax.Lexing.Token.ToString ( )

Reconstructs a string that represents the token, if possible. Does not work for whitespace and comments, because the value of these token types is stored in the original source file and for performance reasons is not copied to the token.

This does not return the original source text; it uses the stringizer in ToStringStrategy, which can be overridden with language- specific behavior by calling SetToStringStrategy.

The returned string, in general, will not match the original token, since the ToStringStrategy does not have access to the original source file.

◆ ToString() [2/2]

string Loyc.Syntax.Lexing.Token.ToString ( ICharSource sourceText )

Gets the original text of the token, if you provide a reference to the original source code text. Note: the method used to convert the token to a string can be overridden with SetToStringStrategy.

Member Data Documentation

◆ Length

int Loyc.Syntax.Lexing.Token.Length => _length

Length of the token in the source file, or 0 for a synthetic or implied token.

Referenced by Loyc.Syntax.LNodeFactory.LiteralFromValueOf(), Loyc.Syntax.StandardTriviaInjector.MakeTriviaAttribute(), Loyc.Syntax.Lexing.Token.Range(), Loyc.Syntax.Lexing.Token.SourceText(), Loyc.Syntax.Lexing.Token.TextValue(), Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().

◆ StartIndex

int Loyc.Syntax.Lexing.Token.StartIndex => _startIndex

Location in the orginal source file where the token starts, or -1 for a synthetic token.

Referenced by Loyc.Syntax.LNodeFactory.LiteralFromValueOf(), Loyc.Syntax.StandardTriviaInjector.MakeTriviaAttribute(), Loyc.Syntax.Lexing.Token.Range(), Loyc.Syntax.Lexing.Token.SourceText(), Loyc.Syntax.Lexing.Token.TextValue(), Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().

◆ Style

NodeStyle Loyc.Syntax.Lexing.Token.Style => (NodeStyle)_stuff

8 bits of nonsemantic information about the token. The style is used to distinguish hex literals from decimal literals, or triple- quoted strings from double-quoted strings.

Referenced by Loyc.Syntax.LNodeFactory.LiteralFromValueOf(), Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().

◆ TypeInt

int Loyc.Syntax.Lexing.Token.TypeInt => _typeInt

Token type.

Referenced by Loyc.Syntax.Lexing.Token.Equals(), Loyc.Syntax.StandardTriviaInjector.MakeTriviaAttribute(), Loyc.Syntax.Les.TokenExt.ToString(), Loyc.Ecs.Parser.TokenExt.ToString(), Loyc.Syntax.Les.TokenExt.Type(), and Loyc.Ecs.Parser.TokenExt.Type().

◆ Value

object Loyc.Syntax.Lexing.Token.Value => IsUninterpretedLiteral ? null : _value

The parsed value of the token, if this structure was initialized with one of the constructors that accepts a value.

Recommended ways to use this property:

For strings: the parsed value of the string (no quotes, escape sequences removed), i.e. a boxed char or a string. A backquoted string in EC#/LES is converted to a Symbol because it is a kind of operator.
For numbers: the parsed value of the number (e.g. 4 => int, 4L => long, 4.0f => float)
For identifiers: the parsed name of the identifier, as a Symbol (e.g. x => x, @for => for, <tt>1+1 => 1+1)
For any keyword including AttrKeyword and TypeKeyword tokens: a Symbol containing the name of the keyword, with "#" prefix
For punctuation and operators: the text of the punctuation as a Symbol.
For openers (open paren, open brace, etc.): null for normal linear parsers. If the tokens have been processed by TokensToTree, this will be a TokenTree.
For spaces and comments: for performance reasons, it is not recommended to extract the text of whitespace from the source file; instead, use WhitespaceTag.Value
When no value is needed (because the Type() is enough): null

Referenced by Loyc.Syntax.Lexing.Token.Equals(), Loyc.Syntax.LNodeFactory.LiteralFromValueOf(), Loyc.Ecs.Parser.EcsTriviaInjector.MakeTriviaAttribute(), Loyc.Syntax.StandardTriviaInjector.MakeTriviaAttribute(), Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().

Property Documentation

◆ Children

TokenTree Loyc.Syntax.Lexing.Token.Children

get

Returns Value as TokenTree (null if not a TokenTree).

Referenced by Loyc.Syntax.Lexing.TokenTree.TokenToLNode().

◆ EndIndex

int Loyc.Syntax.Lexing.Token.EndIndex

get

Returns StartIndex + Length.

Referenced by Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().

◆ IsWhitespace

bool Loyc.Syntax.Lexing.Token.IsWhitespace

get

Returns true if Value == WhitespaceTag.Value.

◆ Kind

TokenKind Loyc.Syntax.Lexing.Token.Kind

get

Token category. This value is only meaningful if the token type integers are based on TokenKinds. Token types for LES and Enhanced C# are, indeed, based on TokenKind.

Referenced by Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().

◆ ToStringStrategy

Func<Token, ICharSource, string>?? Loyc.Syntax.Lexing.Token.ToStringStrategy

staticgetset

Gets or sets the strategy used by ToString.

◆ TypeMarker

Symbol Loyc.Syntax.Lexing.Token.TypeMarker

get

Gets the type marker stored in this token, if this token was initialized with one of the constructors that accepts a type marker.

Referenced by Loyc.Syntax.LNodeFactory.Literal(), and Loyc.Syntax.LNodeFactory.UninterpretedLiteral().

Remarks

Public fields

Public static fields

Properties

Public Member Functions

Static Public Member Functions

Constructor & Destructor Documentation

◆ Token() [1/4]

◆ Token() [2/4]

◆ Token() [3/4]

◆ Token() [4/4]

Member Function Documentation

◆ Equals()

◆ Is()

◆ Range()

◆ SourceText()

◆ TextValue()

◆ ToString() [1/2]

◆ ToString() [2/2]

Member Data Documentation

◆ Length

◆ StartIndex

◆ Style

◆ TypeInt

◆ Value

Property Documentation

◆ Children

◆ EndIndex

◆ IsWhitespace

◆ Kind

◆ ToStringStrategy

◆ TypeMarker