[Home] [By Thread] [By Date] [Recent Entries]
Ameila writes: > This also meets the stated purpose more nearly: start parsing anywhere. > Well, any time you've got namespace bindings (whether to a non-default > prefix or to the default prefix), you can't start parsing anywhere; you > have to know what the bindings are. On the other hand, however you got > to that position in the document, you do know what the context is. I think Pete's idea requires that the (:whatever ... :) tag comes at the top of the document, before the first root. So it can reliably be available before any further parsing. Default namespaces? Scoped IDs? About using the default namespace, as a simplification, It is an idea with looking at. But I think the problem (for parsing from some random point in a document) is not the prefixes but locating the in-scope declarations. All we strictly need is that the namespace binding be known by the time someone looks at that name. Now if we assume some SAX style stream and non-lazt parsing, then that means we indeed will have access to the context of ancestor elements. I think that is the scenario behind Amelia's comment "On the other hand, however you got to that position in the document, you do know what the context is." However, if our scenario is that we are building a lazy tree, and avoiding as far as possible any lexing or parsing on any text until the info is needed. In that case, we really do need to have the declarations at the start (or at the end, or have some index info to allow random access...) Therefore we cannot merely redefine the default namespace at any element we need, I think. What kinds of other things might be needed to make this practical in some sitations? One is this: the language allows multiple branches (like a WF external general entity considered in isolation); lets say each branch root has an @id ID identifier: A1, A2, A3, etc. Now lets say that IDs only need to be unique within each branch, but that IDs can be referenced between branches by prefixing them with the @id of that branch. E.g. if there is some descented element of A1 with an id ABC then it can be referenced as idref ABC (meaning the ABC in the current branch, otherwise, say, the first that appears in the document) at but also A1:ABC. I'll call that a scoped ID. So If I want to open a document, and only look at the sub-branch of A1:ABC, the minimal work is to lexically scan all <> (e.g with SIMD) to find each branch root, for which parse the root start-tag to find the @id of A1, then under that scan all <> (we may have memoized the previous scan, of course) and find any attribute literals and resolve any reference in them, find any with "ABC" and lex.parse that element as needed. So this hierarchical id is one opportunity for parallelism made relatively easy by having the multiple branches in the document. But if this then requires us to locate in-scope namespace declaration the only feasible place for them that is constant-time is at/on the element that uses the prefix or at/on the current branch root, or at the document start, it seems to me. So this means that we could allow different branches to have different prefix bindings and even different default namespaces (if they are allowed) without it having a performance impact. It is declaration or redeclaration within a branch that can blow out (in the lazy scenario above, at least). So the issue of having default namespaces or not, is not one of performance (except for the cost of tokenizing prefixes.) (I still think that default namespaces [expletive deleted]. IIRC the rationale for the default namespace was so that mooted XHTML documents could be in a namespace but look like HTML, and be edited by HTML editors. I don't think that is remotely a requirement, on the lines both that we don't need to, and that it has not taken off.) Regards Rick Background thoughts on modes: probably wrong post A very conventional way to generalize computer language parsers used to be as two independent co-routines 1.lexical analysis goes from text to tokens, 2. the parser proper go from tokens to the parse tree. I think it should be a technical gaol to allow lexing (finding tokens and delimiters) from any milestone point in parallel, and as much parsing as possible. (Please forgive if I flip between automata and grammar terminology below, or make up my own sometimes: it is not an academic paper.) (We can say a language is modal if the grammar co-routine does anything to set the state of the lexer co-routine. SGML is modal at this level. ) We might distinguish different kinds of lexers from the kind of grammar/automata they can have: 1. Single-state You can have only one state and all allowed symbols do not change it. e.g. document = ( A | B | ... all other allowed characters )* 2. Unique-entry point state machine (what is the actual name, I wonder?) This is a state machine where for any input symbol, all transitions go to a single state. This is again one where you can start at any point in the input and know where you are. (We can call this non-modal at the lexical level.) 3. (Deterministic) State machine where at least one input symbol does not always transition to the same state. This kind of state machine, you cannot know which state you are in to start parsing if you happen to start from any non-unique-entry symbol. You have to progress forward until you do find a unique-entry symbol. Lets call it a milestone. 4. Unique-entry-point non-recursive Stack machine This is a stack machine where for any input symbol there is only one transition destination (and push/pop) and transitions form a directed graph, so that the states allowed at each stack level are unique (so there are no state transitions between the states of different levels, only push/pop transitions) This means the possible stack paths are finite. For any point in the input, you can establish both the state and the necessary call stack: you know where you are. (We can call this non-modal at the lexical level) 5. (Deterministic) Stack Machine with at least one transition that does not match the requirements of 4 e.g. because of multiple possible destination states for a symbol may exhibit recursion (direct or indirect) . This kind of stack machine you can only know where you are when starting from any arbitrary point in the input stream if you start with a symbol that is only reachable from the start state, i.e, any non-modal part of the machine. Otherwise, you don't know the stack depth or which transition was taken. So you have to scan forward until you find a unique-entry-point character which, if non-recursive, allows you to recreate the state. Now lets imagine versions of these four which allow parsing backwards as well as forwards. So from any point in the input, we can parse backwards from our milestone. (I *think* it is not necessary that the type of machine that lets you scan forwards is the same as the one that lets you scan backwards.) Providing we can scan back enough But some issues repeat at the grammar level. So if we want to be able to process from any point in the input, we have to find both lexer and parser milestones, and we need to know that lexer state could not have been influenced by signals from the parser co-routine. Bottom line: The more that the lexer and grammar can be non-modal, the more of the work that can be performed in parallel, and less the amount of work that may need to be done completing the parse-tree fragments or token lists and stitching them together. On Fri, Jul 23, 2021 at 1:07 PM Amelia A Lewis <amyzing@t...> wrote: On Thu, 22 Jul 2021 23:20:37 +0100, Pete Cordell wrote:
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |

Cart



