If You Can Name It, You Can Claim It!
Copyright © 2000 by Arbortext, Inc.
04 Apr 2000
Issue 3
System identifiers suck! The fact that XML
requires me to supply system identifiers for external references, and the fact that these
identifiers are required to be Uniform
Resource Identifiers (URIs) is a frequent source of considerable irritation. In this
column, we'll explore how you can use OASIS Catalog files (or their XML equivalent) to
avoid these difficulties.
Using Catalog files became a lot easier earlier this month
when Arbortext released its Java Catalog classes to the XML community. Using
these classes, it's simple to add Catalog support to your favorite Java parser.
(Equivalent support for parsers in other languages should be fairly easy to construct from
the free and Open Source of the Java classes, although Arbortext has no immediate plans to
undertake this effort.)
You can download
the classes or view the JavaDoc
API Documentation online. You can also read Arbortext's press
release about the code.
But first, let's consider the scope of the problem.
There are several common ways that the system
identifier problem raises its ugly head:
- I have an XML document that I want to publish on the web
or include in the distribution of some piece of software. On my system, I keep the doctype
of the document in some local directory, so my doctype declaration reads:
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.0//EN"
"file:///n:/share/doctypes/docbook/xml/docbookx.dtd">
As soon as I distribute this document, I immediately
begin getting error reports from customers who can't read the document because they don't
have DocBook installed at the location identified by the URI in my document. Drat!
- Or I remember to change the URI before I publish the
document:
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.0//EN"
"http://www.oasis-open.org/docbook/xml/4.0/docbookx.dtd">
And the next time I try to edit the document, I get
errors because I happen to be working on my laptop on a plane somewhere and can't get
to the net. Blast!
- Just as often, I get tripped up this way: I'm working
collaboratively with a colleague. She's created initial drafts of some documents that I'm
supposed to review and edit. So I grab them and find that I can't open or publish them
because I don't have the same network connections she has or I don't have Epic installed
in the same place. And if I change the system identifiers so they work on my system, she
has the same problems when I send them back to her. Drat and blast!
All of this makes me want to pull my hair out
because there's a perfectly good solution for this problem: public identifiers. They're
defined in XML, they just aren't used very effectively because currently users cannot rely
on applications resolving them in an interoperable manner.
Public identifiers provide global, unique names for
entities independent of their storage location.
Despite opinions to the contrary[1], I maintain that names and addresses are distinct. If I
claim that I want the version 3.1 of the DocBook DTD, or the 1911 edition of Webster's
dictionary, or Issue 2 of Standard Deviations from Norm, that's what I want,
irrespective of its location on the net (or even if it's available on the net). While it
is possible to view a URL as an address, I don't think that's the natural interpretation.
There are currently two ways that I might reasonably
assign an address-independent name to an object: public identifiers or Uniform Resource Names (URNs)[2].
Public identifiers are part of XML 1.0. They can occur in any form of external
entity declaration. They allow you to give a globally unique name to any entity. For
example, the XML version of DocBook V4.0 is identified with the following public
identifier:
-//OASIS//DTD DocBook XML V4.0//EN
You'll see this identifier in the two doctype
declarations I used earlier. This identifier gives no indication of where the resource
(the DTD) may be found, but it does uniquely name the resource. That public identifier,
now and forever refers to the XML version of DocBook V4.0.
URNs are a form of URI. Like public identifiers, they
give a location-neutral, globally unique name to an entity. For example, OASIS might
choose to identify the XML version of DocBook V4.0 with the following URN:[3]
urn:x-oasis:docbook-xml-v4.0
Like a public identifier, a URN can now and forever
refer to a specific entity in a location-independent manner.
Having extolled the virtues of location-independent
names, it must be said that a name isn't very useful if you can't find the thing it refers
to. In order to do that, you must have a name resolution mechanism that allows you to
determine what resource is referred to by a given name.
One important feature of this mechanism is that it can
allow resources to be distributed, so you don't have to go to http://www.oasis-open.org/docbook/xml/4.0/docbookx.dtd
to get the XML version of DocBook V4.0, if you have a local copy.
There are a few possible resolution mechanisms:
- The application just "knows". Sure, it sounds
a little silly, but this is currently the mechanism being used for namespaces.
Applications know what the semantics of namespaced elements are because they recognize the
namespace URI.
- OASIS Catalog files provide a mechanism for mapping
public and system identifiers, allowing resolution to both local and distributed
resources. This is the resolution scheme we're going to consider for the balance of this
column.
- Many other mechanisms are possible. There are already a
few for URNs, including at least one built on top of DNS, but they aren't widely deployed.
Catalog files are straightforward text files that
describe a mapping from names to addresses. Here's a simple one:
PUBLIC "-//OASIS//DTD XML DocBook V4.0//EN"
"docbook/xml/docbookx.dtd"
SYSTEM "urn:x-oasis:docbook-xml-v4.0"
"docbook/xml/docbookx.dtd"
DELEGATE "-//Arbortext//" "file:///c:/epic/doctypes/catalog"
This file maps both the public identifier and the URN I
mentioned earlier to a local copy of DocBook on my system. If the doctype declaration uses
the public identifier for DocBook, I'll get DocBook regardless of the (possibly
bogus) system identifier! Likewise, my local copy of DocBook will be used if the system
identifier contains the DocBook URN.
The DELEGATE entry instructs the resolver to use the
catalog "c:\epic\doctypes\catalog" for any public identifier that
begins with "-//Arbortext//". The advantage of DELEGATE in this case is that I
don't have to parse that catalog file unless I encounter a public identifier that I
reasonably expect to be in there.
Catalog files are officially defined by OASIS Technical Resolution TR9401, but
for our purposes, the following informal description will suffice[4].
A Catalog is a text file that contains a sequence of
entries. Of the 13 types of entries that are possible, we'll consider only the following
six in this article: BASE, CATALOG, OVERRIDE, DELEGATE, PUBLIC, and SYSTEM:
- BASE uri
- Catalog entries can contain relative URIs. The BASE
entry changes the base URI for subsequent relative URIs. The initial base URI is the URI
of the catalog file.
- CATALOG catalogURI
- Adds the catalog file specified by the catalogURI
to the end of the current catalog. This allows one catalog to refer to another.
- OVERRIDE YES|NO
- The OVERRIDE setting determines whether or not system
identifiers specified in the catalog are to be used in favor of system identifiers
supplied in the document. Suppose you have an entity in your document for which both a
public identifier and a system identifier has been specified, and the catalog only
contains a mapping for the public identifier (e.g., a matching PUBLIC catalog entry). If
OVERRIDE is YES, the system identifier supplied in the matching PUBLIC catalog entry will
be used. If it is NO, the system identifier in the document will be used. (If the catalog
contained a matching SYSTEM catalog entry giving a mapping for the system identifier, that
mapping would have been used, the public identifier would never have been considered, and
the setting of OVERRIDE would have been irrelevant.)
Generally, the purpose of catalogs is to override the system identifiers in XML
documents, so override should be enabled in your catalogs.
- DELEGATE partialPublicId catalogURI
- The DELEGATE entry specifies that public identifiers that
begin with partialPublicId should be resolved using the catalog specified
by the catalogURI. If multiple DELEGATE entries match the public
identifier, they will each be searched, starting with the longest partialPublicId
and continuing to the shortest.
The DELEGATE
entry differs from the CATALOG entry in the following way: alternate catalogs referenced
with a CATALOG entry are parsed and included in the current catalog. Delegated catalogs
are only considered, and consequently only loaded and parsed, if necessary. Delegated
catalogs are also used instead of the current catalog, not as part of the current
catalog.
- PUBLIC publicId systemId
- Maps the public identifier publicId to the
system identifier systemId.
- SYSTEM systemId otherSystemId
- Maps the system identifier systemId to the
alternate system identifier otherSystemId.
Catalog resolution occurs in the following order:
- If a SYSTEM entry matches the specified system
identifier, it is used.
- If a PUBLIC entry matches the specified public
identifier and either OVERRIDE is YES or no system identifier is provided, it is used.
- If no exact match was found for the public identifier,
but it matches one or more of the partial public identifiers specified in DELEGATE
entries, the delegated catalogs are searched for a matching public identifier. (Note that
the system identifier is never provided to the delegated catalogs, so a SYSTEM entry in a
delegated catalog that would have matched the system identifier of the entity in question
is never considered.)
- If there's still no match, ENTITY, DOCTYPE, and NOTATION
entries are considered. (These entries aren't discussed in this article, but are fully
described in the technical resolution.)
If you work with Java applications using a parser that
supports the SAX Parser interface, adding Catalog support to your applications is
a snap. The SAX Parser interface includes an entityResolver hook
designed to provide an application with an opportunity to do this sort of indirection. The
com.arbortext.catalog package implements the full OASIS Catalog semantics and
provides an appropriate class that implements the SAX entityResolver interface.
All you have to do is setup a com.arbortext.catalog.CatalogEntityResolver
on your parser's entityResolver hook. The code listing in Example
1. demonstrates how straightforward this is:
Example 1. Adding
a CatalogEntityResolver to Your Parser
import com.arbortext.catalog.*;
...
CatalogEntityResolver cer = new CatalogEntityResolver();
Catalog myCatalog = new Catalog();
myCatalog.loadSystemCatalogs();
cer.setCatalog(myCatalog);
...
yourParser.setEntityResolver(cer)
The system catalogs are loaded from the system
catalog path, stored in the System property xml.catalog.files. (For all the gory
details about these classes, consult the
API documentation.) You can explicitly parse your own catalogs (perhaps taken from
command line arguments or a Preferences dialog) instead of or in addition to the system
catalogs:
myCatalog.parseCatalog(catalogFile);
The Catalog class can also load XML Catalogs. At
present, the only XML Catalog format recognized is John Cowan's XML Catalog format (formerly
XCatalogs). XML Catalogs are indistinguishable from OASIS Catalogs to your application,
all you have to do to enable XML Catalog processing is supply the name of a class that
implements the SAX Parser interface. In Example 2., the Apache XML
Project's Xerces parser is used.
Example 2. Adding
Support for XML Catalogs
import com.arbortext.xml.*;
...
CatalogEntityResolver cer = new CatalogEntityResolver();
Catalog myCatalog = new Catalog();
myCatalog.setParserClass("com.ibm.xml.parsers.SAXParser"); // support XML Catalogs
myCatalog.loadSystemCatalogs();
cer.setCatalog(myCatalog);
...
yourParser.setEntityResolver(cer)
The Arbortext Catalogs distribution includes two test
programs that you can use to see how this all works. In order to use these programs, you
must have the catalog.jar and catalog-apps.jar files on your CLASSPATH.
The eresolve program also requires a recent version of Xerces on your CLASSPATH.
The README file in the catalog distribution
describes each of the demonstration programs in more detail.
The catalog program takes several catalogs and a
request and displays the system identifier returned by the Catalog.
You can see this program in action in Example
3..
Example 3. Using
the catalog Command
>java catalog -d 0 -c /share/doctypes/catalog PUBLIC "-//OASIS//DTD DocBook XML V4.0//EN"
Ignoring system catalogs.
Set debug to: 0
Adding catalog: /share/doctypes/catalog
Resolving PUBLIC:
Public: -//OASIS//DTD DocBook XML V4.0//EN
System: null
Resolved: file:/share/doctypes/docbook/xml/docbookx.dtd
The second program, eresolve, uses the
CatalogEntityResolver class. A complete test environment is provided in the test
directory:
- catalog
- This is a Catalog with a few simple entries:
OVERRIDE YES
PUBLIC "-//Arbortext//TEXT Test Public Identifier//EN" "testpub.xml"
SYSTEM "urn:x-arbortext:test-system-identifier" "testsys.xml"
OVERRIDE NO
PUBLIC "-//Arbortext//TEXT Test Override//EN" "override.xml"
- test.xml
- This is a test document that contains several external
entities:
<!DOCTYPE test [
<!ENTITY testpub PUBLIC "-//Arbortext//TEXT Test Public Identifier//EN"
"bogus-system-identifier.xml">
<!ENTITY testsys SYSTEM "urn:x-arbortext:test-system-identifier">
<!ENTITY testovr PUBLIC "-//Arbortext//TEXT Test Override//EN"
"testovr.xml">
]>
<test>
&testpub;
&testsys;
&testovr;
</test>
This XML document demonstrates several Catalog
features:
If parsed without a catalog, the parse will fail since bogus-system-identifier.xml
won't be found (and neither would the URN, unless you happen to have some other URN
resolution mechanism running).
If parsed with the included catalog, the following
substitutions will be made:
- &testpub; will be replaced with the contents of
testpub.xml, due to the mapping provided by the first PUBLIC entry in the
catalog.
- &testsys; will be replaced with the contents of
testsys.xml, due to the mapping provided by the SYSTEM entry in the catalog.
- &testovr; will be replaced with the contents of
testovr.xml, due to the system identifier given in its entity declaration; the
mapping provided by the second PUBLIC entry in the catalog is not used because the entity
declaration did provide a system identifier and the matching public identifier occurs
where OVERRIDE is NO.
You can see this process in action in Example 4..
Example 4. Using
the eresolve Command
>java eresolve -d 2 -c test\catalog test\test.xml
Set debug to 2
Adding catalog: test\catalog
Loading catalog: test\catalog
Parsing test\test.xml
Resolved: -//Arbortext//TEXT Test Public Identifier//EN
file:/N:/viewstores/nwalsh_saffron/Epic/src/xml/catalog/test/testpub.xml
Resolved: urn:x-arbortext:test-system-identifier
file:/N:/viewstores/nwalsh_saffron/Epic/src/xml/catalog/test/testsys.xml
Done parsing test\test.xml
This last example demonstrates Catalog resolution in a real
application. The Catalog distribution includes a modified version of the primary driver
from XT, com.arbortext.sax.xsl.Driver. It differs from the com.jclark.sax.xsl.Driver
class only in the addition of Catalog support. You can use it to convert the document in
the test directory to HTML, as shown in Example 4.. You must
have the xt.jar and xp.jar files on your CLASSPATH in order to
run this example.
Note that this example uses the system propert xml.catalog.files
to set the catalog path because the Driver does not support a command-line option
to specify catalog files.
We hope that these classes become a standard part of
all the major XML Parsers. As XML processors incorporate this
code, users will be able to utilize public identifiers in XML documents with the
confidence that they will be able to move those documents from one system to another and
around the Web knowing that they will also be able to refer to the appropriate external
file or Web page.
Norman
Walsh lives in beautiful, rural western Massachusetts where he hacks XML for fun and
profit. He can name lots of things that he's unable to locate, his car keys, for example.