A character class is an atom R that identifies a set of characters C(R). The set of strings L(R) denoted by a character class R contains one single-character string "c" for each character c in C(R).

Character Class

F.1 charClass ::= nt-charClassEsc | nt-charClassExpr

A character class is either a character class escape or a character class expression.

A character class expression is a character group surrounded by [ and ] characters. For all character groups G, [G] is a valid character class expression, identifying the set of characters C([G]) = C(G).

Character Class Expression

F.1 charClassExpr ::= '[' nt-charGroup ']'

A character group is either a positive character group, a negative character group, or a character class subtraction.

Character Group

F.1 charGroup ::= nt-posCharGroup | nt-negCharGroup | nt-charClassSub

A positive character group consists of one or more character ranges or character class escapes, concatenated together. A positive character group identifies the set of characters containing all of the characters in all of the sets identified by its constituent ranges or escapes.

Positive Character Group

F.1 posCharGroup ::= ( nt-charRange | nt-charClassEsc )+

1 For all character ranges R, all character class escapes E, and all positive character groups P, valid positive character groups G are: Identifying the set of characters C(G) containing:

center11R	center11all characters in C(R).
center11E	center11all characters in C(E).
center11RP	center11all characters in C(R) and all characters in C(P).
center11EP	center11all characters in C(E) and all characters in C(P).

A negative character group is a positive character group preceded by the ^ character. For all positive character groups P, ^P is a valid negative character group, and C(^P) contains all XML characters that are not in C(P).

Negative Character Group

F.1 negCharGroup ::= '^' nt-posCharGroup

A character class subtraction is a character class expression subtracted from a positive character group or negative character group, using the - character.

Character Class Subtraction

F.1 charClassSub ::= ( nt-posCharGroup | nt-negCharGroup ) '-' nt-charClassExpr

For any positive character group or negative character group G, and any character class expression C, G-C is a valid character class subtraction, identifying the set of all characters in C(G) that are not also in C(C).

A character range R identifies a set of characters C(R) containing all XML characters with UCS code points in a specified range.

Character Range

F.1	`charRange`	::=	`nt-seRange \| nt-XmlCharRef \| nt-XmlCharIncDash`
F.1	`seRange`	::=	`nt-charOrEsc '-' nt-charOrEsc`
F.1	`XmlCharRef`	::=	`( '&#' [0-9]+ ';' ) \| (' &#x' [0-9a-fA-F]+ ';' )`
F.1	`charOrEsc`	::=	`nt-XmlChar \| nt-SingleCharEsc`
F.1	`XmlChar`	::=	`[^\#x2D#x5B#x5D]`
F.1	`XmlCharIncDash`	::=	`[^\#x5B#x5D]`

A single XML character is a character range that identifies the set of characters containing only itself. All XML characters are valid character ranges, except as follows:

The [, ], and \ characters are not valid character ranges;
The ^ character is only valid at the beginning of a positive character group if it is part of a negative character group; and
The - character is a valid character range only at the beginning or end of a positive character group.

A character range may also be written in the form s-e, identifying the set that contains all XML characters with UCS code points greater than or equal to the code point of s, but not greater than the code point of e.

s-e is a valid character range iff:

s is a single character escape, or an XML character;
s is not \
If s is the first character in a character class expression, then s is not ^
e is a single character escape, or an XML character;
e is not \ or [; and
The code point of e is greater than or equal to the code point of s;

NOTE:
The code point of a single character escape is the code point of the single character in the set of characters that it identifies.

Character Class Escapes[top]

Character Class Escapes

A character class escape is a short sequence of characters that identifies predefined character class. The valid character class escapes are the single character escapes, the multi-character escapes, and the category escapes (including the block escapes).

Character Class Escape

F.1.1 charClassEsc ::= ( nt-SingleCharEsc | nt-MultiCharEsc | nt-catEsc | nt-complEsc )

A single character escape identifies a set containing a only one character -- usually because that character is difficult or impossible to write directly into a regular expression.

Single Character Escape

F.1.1 SingleCharEsc ::= '\' [nrt\|.?*+(){}#x2D#x5B#x5D#x5E]

1 The valid single character escapes are: Identifying the set of characters C(R) containing:

center11`\n`	center11the newline character (#xA)
center11`\r`	center11the return character (#xD)
center11`\t`	center11the tab character (#x9)
center11`\\`	center11\
center11`\\|`	center11\|
center11`\.`	center11.
center11`\-`	center11-
center11`\^`	center11^
center11`\?`	center11?
center11`\*`	center11*
center11`\+`	center11+
center11`\{`	center11{
center11`\}`	center11}
center11`\(`	center11(
center11`\)`	center11)
center11`\[`	center11[
center11`\]`	center11]

[UnicodeDB] specifies a number of possible values for the "General Category" property and provides mappings from code points to specific character properties. The set containing all characters that have property X, can be identified with a category escape \p{X}. The complement of this set is specified with the category escape \P{X}. ([\P{X}] = [^\p{X}]).

Category Escape

F.1.1	`catEsc`	::=	`'\p{' nt-charProp '}'`
F.1.1	`complEsc`	::=	`'\P{' nt-charProp '}'`
F.1.1	`charProp`	::=	`nt-IsCategory \| nt-IsBlock`

NOTE:
[UnicodeDB] is subject to future revision. For example, the mapping from code points to character properties might be updated. All minimally conforming processors must support the character properties defined in the version of [UnicodeDB] that is current at the time this specification became a W3C Recommendation. However, implementors are encouraged to support the character properties defined in any future version.

The following table specifies the recognized values of the "General Category" property.

1center Category Property Meaning

61Letters	center11L	11All Letters
center11Lu	11uppercase
center11Ll	11lowercase
center11Lt	11titlecase
center11Lm	11modifier
center11Lo	11other
31
41Marks	center11M	11All Marks
center11Mn	11nonspacing
center11Mc	11spacing combining
center11Me	11enclosing
31
41Numbers	center11N	11All Numbers
center11Nd	11decimal digit
center11Nl	11letter
center11No	11other
31
81Punctuation	center11P	11All Punctuation
center11Pc	11connector
center11Pd	11dash
center11Ps	11open
center11Pe	11close
center11Pi	11initial quote (may behave like Ps or Pe depending on usage)
center11Pf	11final quote (may behave like Ps or Pe depending on usage)
center11Po	11other
31
41Separators	center11Z	11All Separators
center11Zs	11space
center11Zl	11line
center11Zp	11paragraph
31
51Symbols	center11S	11All Symbols
center11Sm	11math
center11Sc	11currency
center11Sk	11modifier
center11So	11other
31
61Other	center11C	11All Others
center11Cc	11control
center11Cf	11format
center11Co	11private use
center11Cn	11not assigned

Categories

F.1.1	`IsCategory`	::=	`nt-Letters \| nt-Marks \| nt-Numbers \| nt-Punctuation \| nt-Separators \| nt-Symbols \| nt-Others`
F.1.1	`Letters`	::=	`'L' [ultmo]?`
F.1.1	`Marks`	::=	`'M' [nce]?`
F.1.1	`Numbers`	::=	`'N' [dlo]?`
F.1.1	`Punctuation`	::=	`'P' [cdseifo]?`
F.1.1	`Separators`	::=	`'Z' [slp]?`
F.1.1	`Symbols`	::=	`'S' [mcko]?`
F.1.1	`Others`	::=	`'C' [cfon]?`

NOTE:
The properties mentioned above exclude the Cs property. The Cs property identifies "surrogate" characters, which do not occur at the level of the "character abstraction" that XML instance documents operate on.

[UnicodeDB] groups code points into a number of blocks such as Basic Latin (i.e., ASCII), Latin-1 Supplement, Hangul Jamo, CJK Compatibility, etc. The set containing all characters that have block name X (with all white space stripped out), can be identified with a block escape \p{IsX}. The complement of this set is specified with the block escape \P{IsX}. ([\P{IsX}] = [^\p{IsX}]).

Block Escape

F.1.1 IsBlock ::= 'Is' [a-zA-Z0-9#x2D]+

The following table specifies the recognized block names (for more information, see the "Blocks.txt" file in [UnicodeDB]).

1center5ubc Start Code End Code Block Name Start Code End Code Block Name

11#x0000	11#x007F	11BasicLatin	11	11#x0080	11#x00FF	11Latin-1Supplement
11#x0100	11#x017F	11LatinExtended-A	11	11#x0180	11#x024F	11LatinExtended-B
11#x0250	11#x02AF	11IPAExtensions	11	11#x02B0	11#x02FF	11SpacingModifierLetters
11#x0300	11#x036F	11CombiningDiacriticalMarks	11	11#x0370	11#x03FF	11Greek
11#x0400	11#x04FF	11Cyrillic	11	11#x0530	11#x058F	11Armenian
11#x0590	11#x05FF	11Hebrew	11	11#x0600	11#x06FF	11Arabic
11#x0700	11#x074F	11Syriac	11	11#x0780	11#x07BF	11Thaana
11#x0900	11#x097F	11Devanagari	11	11#x0980	11#x09FF	11Bengali
11#x0A00	11#x0A7F	11Gurmukhi	11	11#x0A80	11#x0AFF	11Gujarati
11#x0B00	11#x0B7F	11Oriya	11	11#x0B80	11#x0BFF	11Tamil
11#x0C00	11#x0C7F	11Telugu	11	11#x0C80	11#x0CFF	11Kannada
11#x0D00	11#x0D7F	11Malayalam	11	11#x0D80	11#x0DFF	11Sinhala
11#x0E00	11#x0E7F	11Thai	11	11#x0E80	11#x0EFF	11Lao
11#x0F00	11#x0FFF	11Tibetan	11	11#x1000	11#x109F	11Myanmar
11#x10A0	11#x10FF	11Georgian	11	11#x1100	11#x11FF	11HangulJamo
11#x1200	11#x137F	11Ethiopic	11	11#x13A0	11#x13FF	11Cherokee
11#x1400	11#x167F	11UnifiedCanadianAboriginalSyllabics	11	11#x1680	11#x169F	11Ogham
11#x16A0	11#x16FF	11Runic	11	11#x1780	11#x17FF	11Khmer
11#x1800	11#x18AF	11Mongolian	11	11#x1E00	11#x1EFF	11LatinExtendedAdditional
11#x1F00	11#x1FFF	11GreekExtended	11	11#x2000	11#x206F	11GeneralPunctuation
11#x2070	11#x209F	11SuperscriptsandSubscripts	11	11#x20A0	11#x20CF	11CurrencySymbols
11#x20D0	11#x20FF	11CombiningMarksforSymbols	11	11#x2100	11#x214F	11LetterlikeSymbols
11#x2150	11#x218F	11NumberForms	11	11#x2190	11#x21FF	11Arrows
11#x2200	11#x22FF	11MathematicalOperators	11	11#x2300	11#x23FF	11MiscellaneousTechnical
11#x2400	11#x243F	11ControlPictures	11	11#x2440	11#x245F	11OpticalCharacterRecognition
11#x2460	11#x24FF	11EnclosedAlphanumerics	11	11#x2500	11#x257F	11BoxDrawing
11#x2580	11#x259F	11BlockElements	11	11#x25A0	11#x25FF	11GeometricShapes
11#x2600	11#x26FF	11MiscellaneousSymbols	11	11#x2700	11#x27BF	11Dingbats
11#x2800	11#x28FF	11BraillePatterns	11	11#x2E80	11#x2EFF	11CJKRadicalsSupplement
11#x2F00	11#x2FDF	11KangxiRadicals	11	11#x2FF0	11#x2FFF	11IdeographicDescriptionCharacters
11#x3000	11#x303F	11CJKSymbolsandPunctuation	11	11#x3040	11#x309F	11Hiragana
11#x30A0	11#x30FF	11Katakana	11	11#x3100	11#x312F	11Bopomofo
11#x3130	11#x318F	11HangulCompatibilityJamo	11	11#x3190	11#x319F	11Kanbun
11#x31A0	11#x31BF	11BopomofoExtended	11	11#x3200	11#x32FF	11EnclosedCJKLettersandMonths
11#x3300	11#x33FF	11CJKCompatibility	11	11#x3400	11#x4DB5	11CJKUnifiedIdeographsExtensionA
11#x4E00	11#x9FFF	11CJKUnifiedIdeographs	11	11#xA000	11#xA48F	11YiSyllables
11#xA490	11#xA4CF	11YiRadicals	11	11#xAC00	11#xD7A3	11HangulSyllables
11#xD800	11#xDB7F	11HighSurrogates	11	11#xDB80	11#xDBFF	11HighPrivateUseSurrogates
11#xDC00	11#xDFFF	11LowSurrogates	11	11#xE000	11#xF8FF	11PrivateUse
11#xF900	11#xFAFF	11CJKCompatibilityIdeographs	11	11#xFB00	11#xFB4F	11AlphabeticPresentationForms
11#xFB50	11#xFDFF	11ArabicPresentationForms-A	11	11#xFE20	11#xFE2F	11CombiningHalfMarks
11#xFE30	11#xFE4F	11CJKCompatibilityForms	11	11#xFE50	11#xFE6F	11SmallFormVariants
11#xFE70	11#xFEFE	11ArabicPresentationForms-B	11	11#xFEFF	11#xFEFF	11Specials
11#xFF00	11#xFFEF	11HalfwidthandFullwidthForms	11	11#xFFF0	11#xFFFD	11Specials
11#x10300	11#x1032F	11OldItalic	11	11#x10330	11#x1034F	11Gothic
11#x10400	11#x1044F	11Deseret	11	11#x1D000	11#x1D0FF	11ByzantineMusicalSymbols
11#x1D100	11#x1D1FF	11MusicalSymbols	11	11#x1D400	11#x1D7FF	11MathematicalAlphanumericSymbols
11#x20000	11#x2A6D6	11CJKUnifiedIdeographsExtensionB	11	11#x2F800	11#x2FA1F	11CJKCompatibilityIdeographsSupplement
11#xE0000	11#xE007F	11Tags	11	11#xF0000	11#xFFFFD	11PrivateUse
11#x100000	11#x10FFFD	11PrivateUse	11	11	11	11

NOTE:
[UnicodeDB] is subject to future revision. For example, the grouping of code points into blocks might be updated. All minimally conforming processors must support the blocks defined in the version of [UnicodeDB] that is current at the time this specification became a W3C Recommendation. However, implementors are encouraged to support the blocks defined in any future version of the Unicode Standard.

For example, the block escape for identifying the ASCII characters is \p{IsBasicLatin}.

A multi-character escape provides a simple way to identify a commonly used set of characters:

Multi-Character Escape

F.1.1 MultiCharEsc ::= '.' | ('\' [sSiIcCdDwW])

1center5 Character sequence Equivalent character class

center11.	center11[^\n\r]
center11\s	center11[#x20\t\n\r]
center11\S	center11[^\s]
center11\i	center11 the set of initial name characters, those matched by [Letter] \| '_' \| ':'
center11\I	center11[^\i]
center11\c	center11 the set of name characters, those matched by [NameChar]
center11\C	center11[^\c]
center11\d	center11\p{Nd}
center11\D	center11[^\d]
center11\w	center11 [#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of "punctuation", "separator" and "other" characters)
center11\W	center11[^\w]

NOTE:
The regular expression language defined here does not attempt to provide a general solution to "regular expressions" over UCS character sequences. In particular, it does not easily provide for matching sequences of base characters and combining marks. The language is targeted at support of "Level 1" features as defined in [unicodeRegEx]. It is hoped that future versions of this specification will provide support for "Level 2" features.

[Next Chapter] [Home]

Table of contents

Appendices

F.1 Character Classes

Character Class Escapes[top]