Enhanced C#
Language of your choice: library documentation
Public static fields | Properties | Public Member Functions | Protected fields | Protected static fields | List of all members
Loyc.Syntax.StandardLiteralHandlers Class Reference

A LiteralHandlerTable that is preinitialized with all standard literal parsers and printers. More...


Source file:
Inheritance diagram for Loyc.Syntax.StandardLiteralHandlers:
Loyc.Syntax.LiteralHandlerTable Loyc.Syntax.ILiteralParser Loyc.Syntax.ILiteralPrinter

Remarks

A LiteralHandlerTable that is preinitialized with all standard literal parsers and printers.

The following types are fully supported:

There are also two general type markers, _ for a number of unspecified type, and _U for an unsigned number of unspecified size. In LES, any type marker that is not on this list is legal, but will be left uninterpreted by default; LNode.Value will return a string of type UString.

The syntax corresponding to each type marker is standardized, meaning that all implementations of LES2 and LES3 that can parse these types must do so in exactly the same way.

The Decimal and RegEx types have printers (with default type markers _m and re) but no parser, because these type markers are not standardized. It's worth noting that regular expressions that are valid in one language may be invalid in another, so to avoid parsing issues, re strings are not parsed to RegEx by default.

The type marker "null" represents null, and the only valid value for the null and void type markers is an empty string (""). The only valid values for the bool type marker are "true" and "false" and case variations of these (e.g. "TRUE", "faLSE").

Character literals are special in .NET because they parse into one of two types, either Char or String, depending on whether the code point is less than 0x10000 or not. Code points of 0x10000 or greater, sometimes called "astral" characters, do not fit in the .NET Char type so String is used instead. The type marker "c" indicates that the literal is not really a String. The character parser does return an error if the input is not a single code point (i.e. two code units are fine if and only if they are a surrogate pair). However, the printer keyed to type string doesn't care if the type marker is "c" or not.

Type markers begin with an underscore (_) for numeric types only. The underscore enables special syntax in LES2 and LES3. For example, in LES3, 12345z is equivalent to _z"12345", but strings like re"123" can never be printed in numeric form because their type marker does not start with an underscore.

The syntax of all integer types corresponds to the following case- insensitive regex:

/^[\-\u2212]?({Digits}|0x{HexDigits}|0b{BinDigits})$/

where {Digits} means "[_']*[0-9][0-9_']*", {HexDigits} means "[_']*[0-9a-f][0-9a-f_']*", and {BinDigits} means "[_']*[01][01_']*". While negative numbers can be indicated with the usual dash character '-', the minus character '\x2212' is also allowed. Numbers cannot contain spaces. Parsers will fail in case of overflow, but a BigInteger cannot overflow.

The syntax of a floating-point type corresponds to one of the following case-insensitive regexes for decimal, hexadecimal, binary and non-numbers respectively:

/^[\-\u2212]?({Digits}(\.[0-9_']*)?|\.{Digits})(e[+-]?{Digits})?$/
/^[\-\u2212]?0b({BinDigits}(\.[01_']*)?|\.{BinDigits})(p[+-]?{BinDigits})?$/
/^[\-\u2212]?0x({HexDigits}(\.[0-9a-f_']*)?|\.{HexDigits})(p[+-]?{HexDigits})?$/
/nan|[\-\u2212]?inf/i

These require that any number contains at least one digit, but this digit can appear after the decimal point (.) if there is one. There can be any quantity of separator characters, but no spaces. The final regex allows NaN and infinities to be parsed and printed.

Given these patterns, integers are also detected as floating-point numbers. If a floating-point number is printed without a type marker, or if the type marker is _, the printer recognizes that it must add the suffix ".0" if necessary so that the number will be treated as floating point when it is parsed again later.

Finally, I will point out that LES3 parsers and printers have some special behavior related to string parsing and printing in environments that use UTF-16. There are two kinds of numeric escape sequences in LES3 strings and identifiers: two-digit sequences which look like \xFF, and unicode escape sequences with four to six digits, i.e. \u1234 or \U012345. LES3 strings and identifiers must be interpreted as byte sequences, with \x escape sequences representing raw bytes and \u escape sequences representing proper UTF-8 characters. For UTF-16 environments like .NET, Loyc defines a reversible (lossless) transformation from these byte sequences into UTF-16. \x sequences below \x80 are treated as normal ASCII characters, while \x sequences above \x7F represent raw bytes that may or may not be valid UTF-8. When a sequence is valid UTF-8, e.g. "\xE2\x80\xA2", it is translated into the appropriate UTF-16 character (in this case "•" or "\u2022"). Otherwise, the byte becomes an invalid (unmatched surrogate) code unit in the range 0xDC80 to 0xDCFF, e.g. "\x99" becomes 0xDC99 inside a UTF-16 string. However, this implies that the UTF-8 byte sequence "\xEF\xBF\xBD", which normally represents the same invalid surrogate \uDC99, cannot also be translated to 0xDC99 in UTF-16. Instead it is treated as three independent bytes which become a sequence of three UTF-16 code units, 0xDCEF 0xDCBF 0xDCBD. Upon translation back to UTF-8, these become "\xEF\xBF\xBD" as expected. Furthermore, an LES3 ASCII escape sequence like "\uDC99", which is equivalent to "\xEF\xBF\xBD", should actually produce three code units in a UTF-16 environment (printed as "\uDCEF\uDCBF\uDCBD" in C# or Enhanced C#).

However, none of this trickery necessarily needs to be handled by the parsers and printers in this class, because the challenge appears at a different level. Namely, this trickery applies in the LES3 parser when ASCII escape sequences are converted to invalid surrogates in UTF-16, which happens before the string reaches a parser in this class. Also, some trickery may be done by the LES3 printer after a string is printed by this class. Finally, this special behavior applies only to UTF-16 environments (and UTF-8 environments like Rust that prohibit byte sequences that are not valid UTF-8). No special-case code is necessary in enviroments that use byte arrays for strings, because the purpose of this trickery is to allow LES3 strings to faithfully represent arbitrary byte sequences in addition to Unicode strings.

See also
ParseHelpers, PrintHelpers

Public static fields

static StandardLiteralHandlers Value => _value = _value ?? new StandardLiteralHandlers()
 

Properties

char????? DigitSeparator [get, set]
 Gets or sets a character used to separate groups of digits. It must be must be _ or ' or null, and it is inserted every 3 digits in decimal numbers (e.g. 1_234_567), every 4 digits in hex numbers (e.g. 0x1234_5678), or every 8 digits in binary numbers (e.g. 11_10111000). More...
 

Public Member Functions

 StandardLiteralHandlers (char? digitSeparatorChar='_')
 
- Public Member Functions inherited from Loyc.Syntax.LiteralHandlerTable
bool AddParser (bool replaceExisting, Symbol typeMarker, Func< UString, Symbol, Either< object, LogMessage >> parser)
 Adds a parser to the Parsers collection. More...
 
bool AddPrinter (bool replaceExisting, Symbol type, Func< ILNode, StringBuilder, Either< Symbol, LogMessage >> printer)
 Adds a printer to the Printers collection. More...
 
bool AddPrinter (bool replaceExisting, Type type, Func< ILNode, StringBuilder, Either< Symbol, LogMessage >> printer)
 
bool CanParse (Symbol typeMarker)
 Returns true if there is a parser function for the given type marker. Never throws. More...
 
bool CanPrint (Symbol typeMarker)
 Returns true if there is a printer function for the given type marker. Never throws. More...
 
bool CanPrint (Type type, bool searchBases=true)
 Returns true if there is a printer function for the given type. Never throws. More...
 
Either< object, ILogMessageTryParse (UString textValue, Symbol typeMarker)
 Attempts to parse a string with a given type marker.
 
Either< Symbol, ILogMessageTryPrint (ILNode literal, StringBuilder sb)
 Searches Printers for a printer for the value and uses it to convert the value to a string. When a printer can be found both by type marker Symbol and by Type, the printer for the matching type marker is used (takes priority). The complete search order is (1) type marker (if any), (2) exact type, (3) base class and base interfaces, in that order, recursively, breadth-first. More...
 

Protected fields

int HexNegativeExponentThreshold = -8
 

Protected static fields

static Symbol __u = GSymbol.Get("_u")
 
static Symbol __i8 = GSymbol.Get("_i8")
 
static Symbol __u8 = GSymbol.Get("_u8")
 
static Symbol __i16 = GSymbol.Get("_i16")
 
static Symbol __u16 = GSymbol.Get("_u16")
 
static Symbol __i32 = GSymbol.Get("_i32")
 
static Symbol __u32 = GSymbol.Get("_u32")
 
static Symbol __i64 = GSymbol.Get("_i64")
 
static Symbol __u64 = GSymbol.Get("_u64")
 
static Symbol __z = GSymbol.Get("_z")
 
static Symbol __r32 = GSymbol.Get("_r32")
 
static Symbol __r64 = GSymbol.Get("_r64")
 
static Symbol __L = GSymbol.Get("_L")
 
static Symbol __uL = GSymbol.Get("_uL")
 
static Symbol __f = GSymbol.Get("_f")
 
static Symbol __d = GSymbol.Get("_d")
 
static Symbol _s = GSymbol.Get("s")
 
static Symbol _void = GSymbol.Get("void")
 
static Symbol _bool = GSymbol.Get("bool")
 
static Symbol _c = GSymbol.Get("c")
 
static Symbol _number = GSymbol.Get("_")
 
static Symbol _string = GSymbol.Empty
 
static Symbol _re = GSymbol.Get("re")
 

Additional Inherited Members

- Public fields inherited from Loyc.Syntax.LiteralHandlerTable
IReadOnlyDictionary< Symbol, Func< UString, Symbol, Either< object, LogMessage > > > Parsers => _parsers
 A table of parsers indexed by type marker Symbol. The AddParser method is used to add an item to this collection. More...
 
IReadOnlyDictionary< object, Func< ILNode, StringBuilder, Either< Symbol, LogMessage > > > Printers => _printers
 A table of printers indexed by Type or by type marker Symbol. The AddPrinter methods are used to add an item to this collection. More...
 

Property Documentation

◆ DigitSeparator

char????? Loyc.Syntax.StandardLiteralHandlers.DigitSeparator
getset

Gets or sets a character used to separate groups of digits. It must be must be _ or ' or null, and it is inserted every 3 digits in decimal numbers (e.g. 1_234_567), every 4 digits in hex numbers (e.g. 0x1234_5678), or every 8 digits in binary numbers (e.g. 11_10111000).