[Home] [By Thread] [By Date] [Recent Entries]
Walter,
At Mulberry we recently gave a seminar on the topic of converting HTML to XML, so the issues are fresh in my mind. You're facing a fairly complex set of problems, but they can be simplified (as you are discovering) by distinguishing between A. The syntactic conversion of HTML to XML B. The "semantic" conversion from HTML display-oriented tagging to a stronger form of tagging in XML. Other contributors have posted links to tools that help you with job A -- Tidy and its ilk -- and it appears you've got a handle on that. This work can be largely or entirely automated. Of course, what you get out the other end is still HTML tagging, albeit in XML syntax (it'll be either valid XHTML or a similar XML-compliant HTML), so as you're finding it's not good to go for everything you might do with well-designed XML markup. But to have it XML syntactically is already a big step, because you can then use more and better tools on it to take it the rest of the way -- including (which is the question isn't totally off topic here) XSLT. To do conversion B, however, is an entirely different kettle of fish -- and it is beyond the scope of this list, I'm afraid. As long as I'm already on it, however, I am willing to comment that the scope and difficulty of conversion B is directly related both to the quality of tagging in your source (HTML can be "clean" or "dirty", consistent or messy, even after it's made XML-conformant in its syntax) and, most dramatically, to the nature of your target tag set and to the feasibility of mapping from the HTML you have to this target. Sometimes this conversion can be automated; sometimes it can be mostly automated; often it requires a good measure of attention from human beings to determine how things should be converted in any given case. The design of that target markup, however, is critical; by itself, this factor alone can make or break your project. There is an infinity of things potentially expressible in XML, which a machine, even one programmed with very sophisticated heuristics, will not know how to tag correctly, even when it's starting with some kind of HTML tagging. Accordingly, generally successful efforts at this kind of conversion include both designing that format up front, and controlling its design carefully. Design it to concrete requirements, not just to what you think might be useful or fun to have some day, and don't be over-ambitious. You can't convert to a target you can't see. But if you have a design, the places where conversion is easy or difficult will fairly quickly come to light and you can figure out how to deal with them. I think earlier someone suggested you prototype this first before attempting it. That's very good advice. There are also professionals who will gladly share their experience in this area, if you are in a position to save money over the long term by investing it intelligently in the near term. Good luck, Wendell At 11:52 AM 3/9/2006, you wrote: On Wed, March 8, 2006 5:28 pm, Florent Georges wrote: > Walter Torres wrote: > > >> 1) convert HMTL into well formed HTML (many are not) >> 2) convert well formed HTML into xHTML >> > > Tidy HTML will give you XHTML from HTML. ====================================================================== Wendell Piez mailto:wapiez@xxxxxxxxxxxxxxxx Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9635 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ======================================================================
|

Cart



