Enhanced C#
Language of your choice: library documentation
|
A common token type recommended for Loyc languages that want to use features such as token literals or the TokensToTree class. More...
A common token type recommended for Loyc languages that want to use features such as token literals or the TokensToTree class.
For performance reasons, a Token ought to be a structure rather than a class. But if Token is a struct, we have a conundrum: how do we support tokens from different languages? (We can't use inheritance in structs.)
Luckily, tokens in most languages are very similar. A four-word structure generally suffices:
enum
. All enums can be converted to an integer, so Token uses Int32 as the token type. In order to support DSLs via token literals (e.g. LLLPG is a DSL inside EC#), the TypeInt should be based on TokenKind. Since 64-bit platforms are very common, the Value is 64 bits, and padding increases the structure size from 16 bytes to 24. Given this reality, it was decided to fill in the 4 bytes of padding with additional information:
To save space (and because .NET doesn't handle large structures well), tokens do not know what source file they came from and cannot convert their location to a line number. For this reason, one should keep a reference to the ISourceFile separately. You can then call SourceText(ISourceFile.Text)
to get the original source text, or IIndexToLine.IndexToLine(int) to get the source line number.
A generic token also cannot convert itself to a properly-formatted string. The ToString method does allow you to provide an optional reference to ICharSource which allows the token to get its original text, and in any case you can call SetToStringStrategy to control the method by which a token converts itself to a string.
Fun fact: originally I planned to use Symbol as the common token type, because it is extensible and could nicely represent tokens in all languages; unfortunately, Symbol may reduce parsing performance because it cannot be used with the switch opcode (i.e. the switch statement in C#), so I decided to indicate token types via integers instead. Each language should have, in the namespace of that language, an extension method public static TokenType Type(this Token t)
that converts the TypeInt to the enum type for that language. Optionally, the TokenType enum for your language can be based on TokenKind so that the Kind property returns a meaningful value.
Public fields | |
int | TypeInt => _typeInt |
Token type. More... | |
int | StartIndex => _startIndex |
Location in the orginal source file where the token starts, or -1 for a synthetic token. More... | |
int | Length => _length |
Length of the token in the source file, or 0 for a synthetic or implied token. More... | |
object | Value => IsUninterpretedLiteral ? null : _value |
The parsed value of the token, if this structure was initialized with one of the constructors that accepts a value. More... | |
bool | IsUninterpretedLiteral => (_stuff & 0x01000000) != 0 |
NodeStyle | Style => (NodeStyle)_stuff |
8 bits of nonsemantic information about the token. The style is used to distinguish hex literals from decimal literals, or triple- quoted strings from double-quoted strings. More... | |
int ISimpleToken< int >. | Type => TypeInt |
Public static fields | |
static readonly ThreadLocalVariable< Func< Token, ICharSource, string > > | ToStringStrategyTLV = new ThreadLocalVariable<Func<Token,ICharSource,string>>(Loyc.Syntax.Les.TokenExt.ToString) |
Properties | |
TokenKind | Kind [get] |
Token category. This value is only meaningful if the token type integers are based on TokenKinds. Token types for LES and Enhanced C# are, indeed, based on TokenKind. More... | |
Symbol | TypeMarker [get] |
Gets the type marker stored in this token, if this token was initialized with one of the constructors that accepts a type marker. More... | |
TokenTree | Children [get] |
Returns Value as TokenTree (null if not a TokenTree). More... | |
int | EndIndex [get] |
Returns StartIndex + Length. More... | |
bool | IsWhitespace [get] |
Returns true if Value == WhitespaceTag.Value. More... | |
static Func< Token, ICharSource, string >?? | ToStringStrategy [get, set] |
Gets or sets the strategy used by ToString. More... | |
Token | this[int index] [get] |
int? | Count [get] |
IListSource< IToken< int > > IToken< int >. | Children [get] |
Properties inherited from Loyc.Syntax.Lexing.IToken< int > | |
int | Length [get] |
TokenKind | Kind [get] |
IListSource< IToken< TT > > | Children [get] |
Public Member Functions | |
Token (int type, int startIndex, int length, NodeStyle style=0, object value=null) | |
Initializes the Token structure. More... | |
Token (int type, int startIndex, int length, object value) | |
Token (int type, int startIndex, UString sourceText, NodeStyle style, Symbol typeMarker, int substringStart, int substringEnd) | |
Initializes an "uninterpreted literal" token designed to store two parts of a literal without allocating extra memory (see the Remarks for details). More... | |
Token (int type, int startIndex, int length, NodeStyle style, Symbol typeMarker, UString textValue) | |
Initializes an "uninterpreted literal" token (see the Remarks). More... | |
Token (int type, int startIndex, UString sourceText, NodeStyle style, object valueOrTypeMarker, UString textValue) | |
Initializes a kind of token designed to store two parts of a literal (see the Remarks for details). More... | |
bool | Is (int type, object value) |
Returns true if the specified type and value match this token. More... | |
SourceRange | Range (ISourceFile sf) |
Gets the SourceRange of a token, under the assumption that the token came from the specified source file. More... | |
SourceRange | Range (ILexer< Token > l) |
UString | SourceText (ICharSource chars) |
Gets the original source text for a token if available, under the assumption that the specified source file correctly specifies where the token came from. If the token is synthetic, returns UString.Null. More... | |
UString | SourceText (ILexer< Token > l) |
UString | TextValue (ICharSource source) |
Helps get the "text value" from tokens that used one of the constructors designed to support this use case, e.g. Token(int type, int startIndex, UString tokenText, NodeStyle style, object value, int valueStart, int valueEnd). If one of the other constructors was used, this function returns the same value as SourceText(ICharSource). More... | |
UString | TextValue (ILexer< Token > source) |
override string | ToString () |
Reconstructs a string that represents the token, if possible. Does not work for whitespace and comments, because the value of these token types is stored in the original source file and for performance reasons is not copied to the token. More... | |
string | ToString (ICharSource sourceText) |
Gets the original text of the token, if you provide a reference to the original source code text. Note: the method used to convert the token to a string can be overridden with SetToStringStrategy. More... | |
override bool | Equals (object obj) |
bool | Equals (Token other) |
Equality depends on TypeInt and Value, but not StartIndex and Length (this is the same equality condition as LNode). More... | |
override int | GetHashCode () |
Token | TryGet (int index, out bool fail) |
IEnumerator< Token > | GetEnumerator () |
System.Collections.IEnumerator System.Collections.IEnumerable. | GetEnumerator () |
IRange< Token > IListSource< Token >. | Slice (int start, int count) |
Slice_< Token > | Slice (int start, int count) |
IToken< int > IToken< int >. | WithType (int type) |
Token | WithType (int type) |
IToken< int > IToken< int >. | WithValue (object value) |
Token | WithValue (object value) |
Token | WithRange (int startIndex, int endIndex) |
Token | WithStartIndex (int startIndex) |
IToken< int > ICloneable< IToken< int > >. | Clone () |
LNode | ToLNode (ISourceFile file) |
Public Member Functions inherited from Loyc.Collections.IListSource< Token > | |
IRange< T > | Slice (int start, int count=int.MaxValue) |
Returns a sub-range of this list. More... | |
Public Member Functions inherited from Loyc.Syntax.Lexing.IToken< int > | |
IToken< TT > | WithType (int type) |
IToken< TT > | WithValue (object value) |
Static Public Member Functions | |
static int | Stuff (NodeStyle style, byte substringOffset, byte substringOffsetFromEnd, bool isUninterpretedLiteral) |
static SavedValue< Func< Token, ICharSource, string > > | SetToStringStrategy (Func< Token, ICharSource, string > newValue) |
static bool | IsOpener (TokenKind tt) |
static bool | IsCloser (TokenKind tt) |
static bool | IsOpenerOrCloser (TokenKind tt) |
static Symbol | GetParenPairSymbol (TokenKind k, TokenKind k2) |
|
inline |
|
inline |
Initializes an "uninterpreted literal" token designed to store two parts of a literal without allocating extra memory (see the Remarks for details).
type | Value of TypeInt |
startIndex | Value of StartIndex |
sourceText | A substring of the token in the original source file, such that Length will be sourceText.Length and sourceText.Substring(valueStart - startIndex, valueEnd - valueStart) will be returned from TextValue(ICharSource). For correct results, the ICharSource passed to TextValue later needs to represent the same string that was used to produce this parameter. |
style | Value of Style |
typeMarker | Value of TypeMarker. |
substringStart | Index where the TextValue starts in the source code; should be equal to or greater than startIndex. |
substringEnd | Index where the TextValue ends in the source code; should be equal to or less than startIndex + tokenText.Length. |
Literals in many languages can be broken into two textual parts: their type and their value. For example, in some languages you can write 123.5f, where "f" indicates that the floating-point value has a size of 32 bits. C++ strings have up to three parts, as in u"Hello"_UD
: u
indicates the character type (u = 16-bit unicode) while _UD
indicates that the string should be interpreted in a user-defined way. In LES3, all literals have two parts: value text and a type marker. For example, 123.5f has a text "123.5" and type marker "_f"; greeting"Hello" has text "Hello" and type marker "greeting"; and a simple number like 123 has text "123" and type marker "_".
This constructor allows you to represent up to two "values" in a single token without necessarily allocating memory for them, even though Tokens only contain a single heap reference. When calling this constructor, the second value, called the "TextValue", must be a substring of the token's original source text; for example given the token "Hello"
, the tokenizer would use Hello
as the TextValue. Rather than allocating a string "Hello" and storing it in the token, you can use this constructor to record the fact that the string Hello
begins one character after the beginning of the token (valueStart = 1
) and one character before the end of the token (valueEnd = startIndex + tokenText.Length - 1
). When using this contructor, the Token's Value property returns null; internally the value reference points to the type marker, which is returned from the TypeMarker property rather than Value.
Since a Token does not have a reference to its own source file (ISourceFile), the language parser will need to use the TextValue(ICharSource) method to retrieve the value text later.
Token is a small structure that allocates only 8 bits for the offset between the TextValue and the beginning/end of the sourceText (16 bits total). If the start offset is above 254, the TextValue is combined with the TypeMarker in a heap object of type Tuple<Symbol, UString>, but this is a hidden implementation detail.
For strings that contain escape sequences, such as "Hello\n", you may prefer to store a parsed version of the string in the Token. There is another constructor for this purpose, which always allocates memory: Token(int, int, int, NodeStyle, Symbol, UString).
References Loyc.UString.Length, and Loyc.UString.Slice().
|
inline |
Initializes an "uninterpreted literal" token (see the Remarks).
type | Value of TypeInt |
startIndex | Value of StartIndex |
length | Value of Length |
style | Value of Style. |
typeMarker | Value of TypeMarker. |
textValue | Value returned from TextValue(ICharSource). |
As explained in the documentation of the other constructor (Token(ushort, int, UString, NodeStyle, object, int, int), some literals have two parts which we call the TypeMarker and the TextValue. Since the Token structure only contains a single heap reference, this contructor combines TypeMarker with TextValue in a heap object, but this is a hidden implementation detail; just use TypeMarker and TextValue(ICharSource) to retrieve the values.
|
inline |
Initializes a kind of token designed to store two parts of a literal (see the Remarks for details).
type | Value of TypeInt |
startIndex | Value of StartIndex |
sourceText | A substring of the token in the original source file (something returned from ICharSource.Slice(int, int)), such that Length will be sourceText.Length and SourceText(ICharSource) will return this same string if it is correctly given the same ICharSource object. |
style | Value of Style. |
valueOrTypeMarker | Value of TypeMarker if you are creating an uninterpreted literal or Value if you are not (according to the textValue parameter.) |
textValue | If this Token does NOT represent an uninterpreted literal, this parameter must be default(UString). In any case, this parameter will become the value of TextValue(ICharSource) if that method is correctly given the same ICharSource object from which sourceText was extracted. |
As explained in the documentation of the other constructor (Token(int, int, UString, NodeStyle, Symbol, int, int), some literals have two parts which we call the Value and the TextValue. This constructor is designed to be used when the TextValue is sometimes a substring of the source code and sometimes merely derived from the source code. For example, given the literal "Hello", the correct TextValue is the five characters Hello
, but given the C literal "Hi!\n", you may wish to translate the escape characters in the lexer, and create a Token that refers to the four decoded characters Hi!
(where
represents a newline) rather than the five characters of Hi!
in the original source code.
This constructor uses memory intelligently. If textValue
is a substring of sourceText
, or if textValue.Length
is zero, it will avoid allocating memory for a reference to textValue
(the optimization is described in more detail in the other constructor's documentation.)
References Loyc.UString.InternalString, Loyc.UString.IsNull, and Loyc.UString.Length.
|
inline |
Equality depends on TypeInt and Value, but not StartIndex and Length (this is the same equality condition as LNode).
References Loyc.Syntax.Lexing.Token.TypeInt, and Loyc.Syntax.Lexing.Token.Value.
bool Loyc.Syntax.Lexing.Token.Is | ( | int | type, |
object | value | ||
) |
Returns true if the specified type and value match this token.
|
inline |
Gets the SourceRange of a token, under the assumption that the token came from the specified source file.
References Loyc.Syntax.Lexing.Token.Length, and Loyc.Syntax.Lexing.Token.StartIndex.
Referenced by Loyc.Ecs.Parser.EcsTriviaInjector.MakeTriviaAttribute(), Loyc.Syntax.StandardTriviaInjector.MakeTriviaAttribute(), and Loyc.Syntax.LNodeFactory.UninterpretedLiteral().
|
inline |
Gets the original source text for a token if available, under the assumption that the specified source file correctly specifies where the token came from. If the token is synthetic, returns UString.Null.
References Loyc.Syntax.Lexing.Token.Length, Loyc.Collections.ICharSource.Slice(), and Loyc.Syntax.Lexing.Token.StartIndex.
|
inline |
Helps get the "text value" from tokens that used one of the constructors designed to support this use case, e.g. Token(int type, int startIndex, UString tokenText, NodeStyle style, object value, int valueStart, int valueEnd). If one of the other constructors was used, this function returns the same value as SourceText(ICharSource).
chars | Original source code or lexer from which this token was derived. |
References Loyc.Syntax.Lexing.Token.Length, Loyc.Collections.ICharSource.Slice(), and Loyc.Syntax.Lexing.Token.StartIndex.
Referenced by Loyc.Syntax.LNodeFactory.Literal(), and Loyc.Syntax.LNodeFactory.UninterpretedLiteral().
override string Loyc.Syntax.Lexing.Token.ToString | ( | ) |
Reconstructs a string that represents the token, if possible. Does not work for whitespace and comments, because the value of these token types is stored in the original source file and for performance reasons is not copied to the token.
This does not return the original source text; it uses the stringizer in ToStringStrategy, which can be overridden with language- specific behavior by calling SetToStringStrategy.
The returned string, in general, will not match the original token, since the ToStringStrategy does not have access to the original source file.
string Loyc.Syntax.Lexing.Token.ToString | ( | ICharSource | sourceText | ) |
Gets the original text of the token, if you provide a reference to the original source code text. Note: the method used to convert the token to a string can be overridden with SetToStringStrategy.
int Loyc.Syntax.Lexing.Token.Length => _length |
Length of the token in the source file, or 0 for a synthetic or implied token.
Referenced by Loyc.Syntax.LNodeFactory.LiteralFromValueOf(), Loyc.Syntax.StandardTriviaInjector.MakeTriviaAttribute(), Loyc.Syntax.Lexing.Token.Range(), Loyc.Syntax.Lexing.Token.SourceText(), Loyc.Syntax.Lexing.Token.TextValue(), Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().
int Loyc.Syntax.Lexing.Token.StartIndex => _startIndex |
Location in the orginal source file where the token starts, or -1 for a synthetic token.
Referenced by Loyc.Syntax.LNodeFactory.LiteralFromValueOf(), Loyc.Syntax.StandardTriviaInjector.MakeTriviaAttribute(), Loyc.Syntax.Lexing.Token.Range(), Loyc.Syntax.Lexing.Token.SourceText(), Loyc.Syntax.Lexing.Token.TextValue(), Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().
8 bits of nonsemantic information about the token. The style is used to distinguish hex literals from decimal literals, or triple- quoted strings from double-quoted strings.
Referenced by Loyc.Syntax.LNodeFactory.LiteralFromValueOf(), Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().
int Loyc.Syntax.Lexing.Token.TypeInt => _typeInt |
object Loyc.Syntax.Lexing.Token.Value => IsUninterpretedLiteral ? null : _value |
The parsed value of the token, if this structure was initialized with one of the constructors that accepts a value.
Recommended ways to use this property:
1+1
) Referenced by Loyc.Syntax.Lexing.Token.Equals(), Loyc.Syntax.LNodeFactory.LiteralFromValueOf(), Loyc.Ecs.Parser.EcsTriviaInjector.MakeTriviaAttribute(), Loyc.Syntax.StandardTriviaInjector.MakeTriviaAttribute(), Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().
|
get |
Returns Value as TokenTree (null if not a TokenTree).
Referenced by Loyc.Syntax.Lexing.TokenTree.TokenToLNode().
|
get |
Returns StartIndex + Length.
Referenced by Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().
|
get |
Returns true if Value == WhitespaceTag.Value.
|
get |
Token category. This value is only meaningful if the token type integers are based on TokenKinds. Token types for LES and Enhanced C# are, indeed, based on TokenKind.
Referenced by Loyc.Syntax.Lexing.TokenTree.TokenToLNode(), Loyc.Syntax.Les.TokenExt.ToString(), and Loyc.Ecs.Parser.TokenExt.ToString().
|
staticgetset |
Gets or sets the strategy used by ToString.
|
get |
Gets the type marker stored in this token, if this token was initialized with one of the constructors that accepts a type marker.
Referenced by Loyc.Syntax.LNodeFactory.Literal(), and Loyc.Syntax.LNodeFactory.UninterpretedLiteral().