[Home] [By Thread] [By Date] [Recent Entries]
At 01:38 17/01/98 -0600, Jeremie Miller wrote: >> [paragraphs removed] > >> >>So you have the following choice: >> - encode the *whole* spec (and nothing but the spec - i.e. no tricky >>non-compliant extensions) and give yourself the label "conforming XML >tool". >> - encode the bits you feel are cost effective and label it "processes most >>XML documents, but gives 'Sorry' messages for some". [... picking up some of David Durand's concerns ...] I appreciate the strength of David's arguments and personally will wish to work with totally XML-compliant software. However it is a *lot* of work. One design goal (4 in spec) is that it should be "easy to write programs which process XML documents". If that is interpreted that it is "easy to write software that processes *all* XML documents, throwing errors wherever one is required", then that goal is already lost. For example, James Clark has come up with about 140 carefully incorrect XML documents for testing parsers. DavidM has said that AElfred spots 80% of them, but that the other 20% would increase AElfred's size and decrease its speed. [And probably involve the author in a lot more work.] I'm not making a moral judgment - simply reporting facts in the PD. Personally I think that XML is overly complex for goal 4 and have been privileged to be able to say so on numerous occasions. However I accept the consensus and will do what I can to support it. However, I think there will be domains where the full functionality (or at least the full syntax) of XML will not be used. In that case there will be "simple tools" that process XML documents. Not *all* XML documents, but a lot. It seems to me reasonable that these tools can tell the user if they can't process a document. It's common for compilers to say "sorry, this expression is just too complicated for me to deal with - you'll have to break it up a bit". I can see a tool saying "sorry, I don't deal with CDATA; please try another parser". [The reason I have several parsers running under JUMBO is that - at this stage - they all have things they can't do...] The WG has (I think rightly) said that there should not be conformance levels in XML. [For those not familiar with SGML, there are a large number of different options, many of which are not supported by many parsers.] But I suspect there will be a number of tools which don't support the whole spec - this is a neutral statement. And there will be a number of documents that don't use the whole functionality of XML - this is also a neutral statement. We have frequently talked about the Desperate Perl Hacker writing tools which are sufficient to process a class of XML documents, but not all. I can see convergence between these activities. > >More questions/issues then: > >A well-formed XML document is not required to have a DTD, internal or >external, correct? Correct. The inverse can be stated as "if a document does not have a DTD subset , then it can only be well-formed". > Is a well-formed parser not an XML parser that does not >have access to or does not process a DTD, internal or external? I guess I >haven't found a clear definition of what a well-formed parser is yet. I think we are all looking for enlightenment in this area. There are at least the following categories: A Document + DTD + request to validate document. Requires a validating parser. B Document + full DTD but no request to validate. C Document + parts of a DTD (e.g. a few ELEMENTs and ATTLISTs, maybe an external subset which covers some of the ELEMENTs in the document). D Document with no internal or external subset. Can only be well-formed. What the difference between A and B is is not clear to me. IMO there are several people/robots who can urge that a document be validated (author/server/client/application/reader). What is clear is that *all the information in the DTDs must be processed and the document altered accordingly*. Note that Lark and AElfred both throw errors for <!DOCTYPE FOO SYSTEM "bar.dtd"> if bar.dtd cannot be found. This is reasonable (though frustrating) since bar.dtd can alter the information in the document. NOTE BTW. If an entity is declared in both the internal and external subsets then the one in the internal subset is processed first. [This fooled me for some time because the IS occurs 'later' in the physical document...] C is similar to B, but validation is not possible. It is *essential* that if ATTLISTs and ENTITYs (and NOTATION) exist, then the information in them MUST be applied to the document. I think it is here that the differences of opinion occur. If I get a document with a NOTATION, I may just say "sorry, I can't grok NOTATION, so bomb out", but others see this as an unacceptable position. D seems to me entirely acceptable. If there is no DTD subset, then a parser can be cleanly built which deals with exactly what is potentially carried in well-formed/no_subset documents [you can see we need a terminology here :-)] >If this is true, then a well-formed parser doesn't even have to acknowledge >that entities exist except for the built in ones, NO. *IFF* an ENTITY is declared (case C), the parser MUST process it. Otherwise the content of the emitted information is incorrect. If a WF document contains a reference to an entity (e.g. &foo;) then a 'correct' document automatically falls into (C). A WF/no_subset parser can then only report that an undeclared entity was discovered (and that even it it had been declared, that parser couldn't manage it). >and absolutely all whitespace is preserved, right? Yes :-). The *application* can throw this away, the parser can't. So JUMBO will soon have the options "discard all PCDATA elements which contain only whitespace", or "ignore all [these elements] when emitted by a parser." A human has to press the button to make this happen :-). NOTE that it is possible that a subset in a (C) document can contain enough information to detect what the parser could do with whitespace. Whether it should *act* on that information is unclear. For example, the single declaration: <!ELEMENT FOO (PLUGH|BAR)*> says that FOO contains element content and therefore cannot contain PCDATA. Any whitespace PCDATA is therefore "ignorable". This information is not sufficient to *validate* the document (there are no declarations for BAR and PLUGH, for example). The declaration <!ELEMENT FOO ANY> allows PCDATA, so doesn't help much. Some people have argued for a content model which includes something like #ANYNONPCDATA, but that is not legal XML. P. Peter Murray-Rust, Director Virtual School of Molecular Sciences, domestic net connection VSMS http://www.nottingham.ac.uk/vsms, Virtual Hyperglossary http://www.venus.co.uk/vhg xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|

Cart



