[Home] [By Thread] [By Date] [Recent Entries]
Here is an updated grammar and examples. Added are Clark names {URL}:name Link tags <: :> scoped IDREFs rootid:myid short tags So it is two parts:
This uses some extensions: == means "if" --> $something means a data type conversion -> means a substitution (handling references) . means a look-up in the lexical context, just a shorthand. GRAMMAR: document = (link | comment | pi )* element (element | comment | pi )* Comment: a document can have multiple branches not a single root link = prefix attribute EOM Comment: a link is a kind of element that is scoped by namespace prefix or branch id: prefix = LINK-START.TOKEN == TOKEN (could be empty for defalt) element = start-tag ( CHARACTER+ | element | comment | pi)* end-tag start-tag = name attribute* EOM name = START-TAG.BI_TOKEN --> clark-name attribute = attname ( typeable-token | ATTRIBUTE-TEXT) attname = BI_TOKEN --> clark-name
typeable-token = boolean | year | | symbol | id | prefixed-name prefixed-name = BI_TOKEN | clark-name == contains ":" --> clark-name boolean = TOKEN == ("true" | "false" ) --> $boolean --> $yearDate number
= TOKEN --> $integer or $decimal id = TOKEN --> ID // iff lexer knows that this is a branch root and attribute name is "id", it can do this symbol = TOKEN end-tag = END-TAG.BI_TOKEN EOM --> clark-name EOM Comment: the name in an end tag does not require a prefix or {} url comment = COMMENT-TAG.CHARACTER* EOM --> clark-name EOM pi = piname CHAR* EOM piname = PI-TAG.BI-TOKEN EOM --> clark-name EOM clark-name = ("{" .* "}": )? TOKEN Each lexical pass can be thread-parallelized by section. And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed. And the recognition can be parallelized using SIMD. LEXICAL PASS 1: TAG DEMARCATION TEXT = ws* ("<" MARKUP EOM==">" DATA? )+ Note: A terminating "data" section should be marked as ws.
Note: EOM is the only delimiter signal the lexer needs to provide up,
but it is only actually needed for start-tags, and would not be part of
an infoset.
LEXICAL PASS 2: ATTRIBUTE DEMARCATION MARKUP = ((?=[^!/?:]) START-TAG | COMPLEX-TAG START-TAG = (TAG-TEXT \" ATTRIBUTE-TAG \"? ) + Note: apos not supported as attribute delimiter here.
LEXICAL PASS 3: REFERENCE SUBSTITUTION ( DATA | ATTRIBUTE-TEXT | SIMPLE-TAG | COMPLEX-TAG LINK-TAG) -> (CHARACTER | NUMERIC-CHARACTER-REFERENCE -> CHARACTER | ENTITY-REFERENCE -> CHARACTER+)* Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the production in, but it looks for &. Note:
LEXICAL PASS 4: TOKENIZATION TAG-TEXT = ( ws | "=" | BI_TOKEN )+
COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG | LINK-TAG
PI-TAG = "?" BI-TOKEN ws* CHARACTER* "?" END-TAG = "/" BI_TOKEN ws* LINK-TAG = ":" TOKEN? ws* (TAG-TEXT \" ATTRIBUTE-TAG \"? ) + ":" BI_TOKEN = [^\S<"=]+ So an example: the Purchase order example could come in without change, but here I have some typed recognition of numbers, dates and tokens in attributes. <?hello abcd ?> <!-- comment --> <svg:svg height=100 width=100 id=ABC> <!-- Below we have examples of a full QName used, a scoped link, and a dropped-prefix end-tag --> <svg:svg width=400 height=110> <{http://www.example.com/link}:somelink to=ABC:XYZ ></somelink> </svg> <!-- note: end of document --> On Thu, Jul 22, 2021 at 8:06 PM Rick Jelliffe <rjelliffe@a...> wrote:
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |

Cart



