Stylus Studio XML Editor

Table of contents

Appendices

F.1 Character Classes

Character Classes

A character class is an atom  R that identifies a set of characters  C(R). The set of strings L(R) denoted by a character class R contains one single-character string "c" for each character c in C(R).

Character Class
F.1    charClass   ::=    nt-charClassEsc | nt-charClassExpr

A character class is either a character class escape or a character class expression.

A character class expression is a character group surrounded by [ and ] characters. For all character groups G, [G] is a valid character class expression, identifying the set of characters C([G]) = C(G).

Character Class Expression
F.1    charClassExpr   ::=   '[' nt-charGroup ']'

A character group is either a positive character group, a negative character group, or a character class subtraction.

Character Group
F.1    charGroup   ::=    nt-posCharGroup | nt-negCharGroup | nt-charClassSub

A positive character group consists of one or more character ranges or character class escapes, concatenated together. A positive character group identifies the set of characters containing all of the characters in all of the sets identified by its constituent ranges or escapes.

Positive Character Group
F.1    posCharGroup   ::=    ( nt-charRange | nt-charClassEsc )+

1 For all character ranges R, all character class escapes E, and all positive character groups P, valid positive character groups G are: Identifying the set of characters C(G) containing:
center11R center11all characters in C(R).
center11E center11all characters in C(E).
center11RP center11all characters in C(R) and all characters in C(P).
center11EP center11all characters in C(E) and all characters in C(P).

A negative character group is a positive character group preceded by the ^ character. For all positive character groups P, ^P is a valid negative character group, and C(^P) contains all XML characters that are not in C(P).

Negative Character Group
F.1    negCharGroup   ::=   '^' nt-posCharGroup

A character class subtraction is a character class expression subtracted from a positive character group or negative character group, using the - character.

Character Class Subtraction
F.1    charClassSub   ::=    ( nt-posCharGroup | nt-negCharGroup ) '-' nt-charClassExpr

For any positive character group or negative character group G, and any character class expression C, G-C is a valid character class subtraction, identifying the set of all characters in C(G) that are not also in C(C).

A character range R identifies a set of characters C(R) containing all XML characters with UCS code points in a specified range.

Character Range
F.1    charRange   ::=    nt-seRange | nt-XmlCharRef | nt-XmlCharIncDash
F.1    seRange   ::=   nt-charOrEsc '-' nt-charOrEsc
F.1    XmlCharRef   ::=   ( '&#' [0-9]+ ';' ) | (' &#x' [0-9a-fA-F]+ ';' )
F.1    charOrEsc   ::=   nt-XmlChar | nt-SingleCharEsc
F.1    XmlChar   ::=   [^\#x2D#x5B#x5D]
F.1    XmlCharIncDash   ::=   [^\#x5B#x5D]

A single XML character is a character range that identifies the set of characters containing only itself. All XML characters are valid character ranges, except as follows:

A character range may also be written in the form s-e, identifying the set that contains all XML characters with UCS code points greater than or equal to the code point of s, but not greater than the code point of e.

s-e is a valid character range iff:

NOTE: 

The code point of a single character escape is the code point of the single character in the set of characters that it identifies.

Character Class Escapes[top]

Character Class Escapes

A character class escape is a short sequence of characters that identifies predefined character class. The valid character class escapes are the single character escapes, the multi-character escapes, and the category escapes (including the block escapes).

Character Class Escape
F.1.1    charClassEsc   ::=    ( nt-SingleCharEsc | nt-MultiCharEsc | nt-catEsc | nt-complEsc )

A single character escape identifies a set containing a only one character -- usually because that character is difficult or impossible to write directly into a regular expression.

Single Character Escape
F.1.1    SingleCharEsc   ::=   '\' [nrt\|.?*+(){}#x2D#x5B#x5D#x5E]

1 The valid single character escapes are: Identifying the set of characters C(R) containing:
center11\n center11the newline character (#xA)
center11\r center11the return character (#xD)
center11\t center11the tab character (#x9)
center11\\ center11\
center11\| center11|
center11\. center11.
center11\- center11-
center11\^ center11^
center11\? center11?
center11\* center11*
center11\+ center11+
center11\{ center11{
center11\} center11}
center11\( center11(
center11\) center11)
center11\[ center11[
center11\] center11]

[UnicodeDB] specifies a number of possible values for the "General Category" property and provides mappings from code points to specific character properties. The set containing all characters that have property X, can be identified with a category escape \p{X}. The complement of this set is specified with the category escape \P{X}. ([\P{X}] = [^\p{X}]).

Category Escape
F.1.1    catEsc   ::=   '\p{' nt-charProp '}'
F.1.1    complEsc   ::=   '\P{' nt-charProp '}'
F.1.1    charProp   ::=   nt-IsCategory | nt-IsBlock
NOTE: 

[UnicodeDB] is subject to future revision. For example, the mapping from code points to character properties might be updated. All minimally conforming processors must support the character properties defined in the version of [UnicodeDB] that is current at the time this specification became a W3C Recommendation. However, implementors are encouraged to support the character properties defined in any future version.

The following table specifies the recognized values of the "General Category" property.

1center Category Property Meaning
61Letters center11L 11All Letters
center11Lu 11uppercase
center11Ll 11lowercase
center11Lt 11titlecase
center11Lm 11modifier
center11Lo 11other
31 
41Marks center11M 11All Marks
center11Mn 11nonspacing
center11Mc 11spacing combining
center11Me 11enclosing
31 
41Numbers center11N 11All Numbers
center11Nd 11decimal digit
center11Nl 11letter
center11No 11other
31 
81Punctuation center11P 11All Punctuation
center11Pc 11connector
center11Pd 11dash
center11Ps 11open
center11Pe 11close
center11Pi 11initial quote (may behave like Ps or Pe depending on usage)
center11Pf 11final quote (may behave like Ps or Pe depending on usage)
center11Po 11other
31 
41Separators center11Z 11All Separators
center11Zs 11space
center11Zl 11line
center11Zp 11paragraph
31 
51Symbols center11S 11All Symbols
center11Sm 11math
center11Sc 11currency
center11Sk 11modifier
center11So 11other
31 
61Other center11C 11All Others
center11Cc 11control
center11Cf 11format
center11Co 11private use
center11Cn 11not assigned
Categories
F.1.1    IsCategory   ::=    nt-Letters | nt-Marks | nt-Numbers | nt-Punctuation | nt-Separators | nt-Symbols | nt-Others
F.1.1    Letters   ::=   'L' [ultmo]?
F.1.1    Marks   ::=   'M' [nce]?
F.1.1    Numbers   ::=   'N' [dlo]?
F.1.1    Punctuation   ::=   'P' [cdseifo]?
F.1.1    Separators   ::=   'Z' [slp]?
F.1.1    Symbols   ::=   'S' [mcko]?
F.1.1    Others   ::=   'C' [cfon]?
NOTE: 

The properties mentioned above exclude the Cs property. The Cs property identifies "surrogate" characters, which do not occur at the level of the "character abstraction" that XML instance documents operate on.

[UnicodeDB] groups code points into a number of blocks such as Basic Latin (i.e., ASCII), Latin-1 Supplement, Hangul Jamo, CJK Compatibility, etc. The set containing all characters that have block name X (with all white space stripped out), can be identified with a block escape \p{IsX}. The complement of this set is specified with the block escape \P{IsX}. ([\P{IsX}] = [^\p{IsX}]).

Block Escape
F.1.1    IsBlock   ::=   'Is' [a-zA-Z0-9#x2D]+

The following table specifies the recognized block names (for more information, see the "Blocks.txt" file in [UnicodeDB]).

1center5ubc Start Code End Code Block Name   Start Code End Code Block Name
11#x0000 11#x007F 11BasicLatin 11  11#x0080 11#x00FF 11Latin-1Supplement
11#x0100 11#x017F 11LatinExtended-A 11  11#x0180 11#x024F 11LatinExtended-B
11#x0250 11#x02AF 11IPAExtensions 11  11#x02B0 11#x02FF 11SpacingModifierLetters
11#x0300 11#x036F 11CombiningDiacriticalMarks 11  11#x0370 11#x03FF 11Greek
11#x0400 11#x04FF 11Cyrillic 11  11#x0530 11#x058F 11Armenian
11#x0590 11#x05FF 11Hebrew 11  11#x0600 11#x06FF 11Arabic
11#x0700 11#x074F 11Syriac 11  11#x0780 11#x07BF 11Thaana
11#x0900 11#x097F 11Devanagari 11  11#x0980 11#x09FF 11Bengali
11#x0A00 11#x0A7F 11Gurmukhi 11  11#x0A80 11#x0AFF 11Gujarati
11#x0B00 11#x0B7F 11Oriya 11  11#x0B80 11#x0BFF 11Tamil
11#x0C00 11#x0C7F 11Telugu 11  11#x0C80 11#x0CFF 11Kannada
11#x0D00 11#x0D7F 11Malayalam 11  11#x0D80 11#x0DFF 11Sinhala
11#x0E00 11#x0E7F 11Thai 11  11#x0E80 11#x0EFF 11Lao
11#x0F00 11#x0FFF 11Tibetan 11  11#x1000 11#x109F 11Myanmar
11#x10A0 11#x10FF 11Georgian 11  11#x1100 11#x11FF 11HangulJamo
11#x1200 11#x137F 11Ethiopic 11  11#x13A0 11#x13FF 11Cherokee
11#x1400 11#x167F 11UnifiedCanadianAboriginalSyllabics 11  11#x1680 11#x169F 11Ogham
11#x16A0 11#x16FF 11Runic 11  11#x1780 11#x17FF 11Khmer
11#x1800 11#x18AF 11Mongolian 11  11#x1E00 11#x1EFF 11LatinExtendedAdditional
11#x1F00 11#x1FFF 11GreekExtended 11  11#x2000 11#x206F 11GeneralPunctuation
11#x2070 11#x209F 11SuperscriptsandSubscripts 11  11#x20A0 11#x20CF 11CurrencySymbols
11#x20D0 11#x20FF 11CombiningMarksforSymbols 11  11#x2100 11#x214F 11LetterlikeSymbols
11#x2150 11#x218F 11NumberForms 11  11#x2190 11#x21FF 11Arrows
11#x2200 11#x22FF 11MathematicalOperators 11  11#x2300 11#x23FF 11MiscellaneousTechnical
11#x2400 11#x243F 11ControlPictures 11  11#x2440 11#x245F 11OpticalCharacterRecognition
11#x2460 11#x24FF 11EnclosedAlphanumerics 11  11#x2500 11#x257F 11BoxDrawing
11#x2580 11#x259F 11BlockElements 11  11#x25A0 11#x25FF 11GeometricShapes
11#x2600 11#x26FF 11MiscellaneousSymbols 11  11#x2700 11#x27BF 11Dingbats
11#x2800 11#x28FF 11BraillePatterns 11  11#x2E80 11#x2EFF 11CJKRadicalsSupplement
11#x2F00 11#x2FDF 11KangxiRadicals 11  11#x2FF0 11#x2FFF 11IdeographicDescriptionCharacters
11#x3000 11#x303F 11CJKSymbolsandPunctuation 11  11#x3040 11#x309F 11Hiragana
11#x30A0 11#x30FF 11Katakana 11  11#x3100 11#x312F 11Bopomofo
11#x3130 11#x318F 11HangulCompatibilityJamo 11  11#x3190 11#x319F 11Kanbun
11#x31A0 11#x31BF 11BopomofoExtended 11  11#x3200 11#x32FF 11EnclosedCJKLettersandMonths
11#x3300 11#x33FF 11CJKCompatibility 11  11#x3400 11#x4DB5 11CJKUnifiedIdeographsExtensionA
11#x4E00 11#x9FFF 11CJKUnifiedIdeographs 11  11#xA000 11#xA48F 11YiSyllables
11#xA490 11#xA4CF 11YiRadicals 11  11#xAC00 11#xD7A3 11HangulSyllables
11#xD800 11#xDB7F 11HighSurrogates 11  11#xDB80 11#xDBFF 11HighPrivateUseSurrogates
11#xDC00 11#xDFFF 11LowSurrogates 11  11#xE000 11#xF8FF 11PrivateUse
11#xF900 11#xFAFF 11CJKCompatibilityIdeographs 11  11#xFB00 11#xFB4F 11AlphabeticPresentationForms
11#xFB50 11#xFDFF 11ArabicPresentationForms-A 11  11#xFE20 11#xFE2F 11CombiningHalfMarks
11#xFE30 11#xFE4F 11CJKCompatibilityForms 11  11#xFE50 11#xFE6F 11SmallFormVariants
11#xFE70 11#xFEFE 11ArabicPresentationForms-B 11  11#xFEFF 11#xFEFF 11Specials
11#xFF00 11#xFFEF 11HalfwidthandFullwidthForms 11  11#xFFF0 11#xFFFD 11Specials
11#x10300 11#x1032F 11OldItalic 11  11#x10330 11#x1034F 11Gothic
11#x10400 11#x1044F 11Deseret 11  11#x1D000 11#x1D0FF 11ByzantineMusicalSymbols
11#x1D100 11#x1D1FF 11MusicalSymbols 11  11#x1D400 11#x1D7FF 11MathematicalAlphanumericSymbols
11#x20000 11#x2A6D6 11CJKUnifiedIdeographsExtensionB 11  11#x2F800 11#x2FA1F 11CJKCompatibilityIdeographsSupplement
11#xE0000 11#xE007F 11Tags 11  11#xF0000 11#xFFFFD 11PrivateUse
11#x100000 11#x10FFFD 11PrivateUse 11  11  11  11 
NOTE: 

[UnicodeDB] is subject to future revision. For example, the grouping of code points into blocks might be updated. All minimally conforming processors must support the blocks defined in the version of [UnicodeDB] that is current at the time this specification became a W3C Recommendation. However, implementors are encouraged to support the blocks defined in any future version of the Unicode Standard.

For example, the block escape for identifying the ASCII characters is \p{IsBasicLatin}.

A multi-character escape provides a simple way to identify a commonly used set of characters:

Multi-Character Escape
F.1.1    MultiCharEsc   ::=   '.' | ('\' [sSiIcCdDwW])

1center5 Character sequence Equivalent character class
center11. center11[^\n\r]
center11\s center11[#x20\t\n\r]
center11\S center11[^\s]
center11\i center11 the set of initial name characters, those matched by [Letter] | '_' | ':'
center11\I center11[^\i]
center11\c center11 the set of name characters, those matched by [NameChar]
center11\C center11[^\c]
center11\d center11\p{Nd}
center11\D center11[^\d]
center11\w center11 [#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of "punctuation", "separator" and "other" characters)
center11\W center11[^\w]
NOTE: 

The regular expression language defined here does not attempt to provide a general solution to "regular expressions" over UCS character sequences. In particular, it does not easily provide for matching sequences of base characters and combining marks. The language is targeted at support of "Level 1" features as defined in [unicodeRegEx]. It is hoped that future versions of this specification will provide support for "Level 2" features.