Subject: RE: plea for help...
From: Mike Ferrando <mikeferrando@xxxxxxxxx>
Date: Thu, 9 Mar 2006 14:31:08 -0800 (PST)
|
Wendell,
I attended.
It was very well done. A great help for beginners as well as good
insights for those with lots of battle scars.
Thanks,
Mike Ferrando
Library Technician
Library of Congress
Washington, DC
202-707-4454
--- Wendell Piez <wapiez@xxxxxxxxxxxxxxxx> wrote:
> Walter,
>
> At Mulberry we recently gave a seminar on the topic of converting
> HTML to XML, so the issues are fresh in my mind.
>
> You're facing a fairly complex set of problems, but they can be
> simplified (as you are discovering) by distinguishing between
>
> A. The syntactic conversion of HTML to XML
> B. The "semantic" conversion from HTML display-oriented tagging to
> a
> stronger form of tagging in XML.
>
> Other contributors have posted links to tools that help you with
> job
> A -- Tidy and its ilk -- and it appears you've got a handle on
> that.
> This work can be largely or entirely automated. Of course, what you
>
> get out the other end is still HTML tagging, albeit in XML syntax
> (it'll be either valid XHTML or a similar XML-compliant HTML), so
> as
> you're finding it's not good to go for everything you might do with
>
> well-designed XML markup. But to have it XML syntactically is
> already
> a big step, because you can then use more and better tools on it to
>
> take it the rest of the way -- including (which is the question
> isn't
> totally off topic here) XSLT.
>
> To do conversion B, however, is an entirely different kettle of
> fish
> -- and it is beyond the scope of this list, I'm afraid.
>
> As long as I'm already on it, however, I am willing to comment that
>
> the scope and difficulty of conversion B is directly related both
> to
> the quality of tagging in your source (HTML can be "clean" or
> "dirty", consistent or messy, even after it's made XML-conformant
> in
> its syntax) and, most dramatically, to the nature of your target
> tag
> set and to the feasibility of mapping from the HTML you have to
> this target.
>
> Sometimes this conversion can be automated; sometimes it can be
> mostly automated; often it requires a good measure of attention
> from
> human beings to determine how things should be converted in any
> given case.
>
> The design of that target markup, however, is critical; by itself,
> this factor alone can make or break your project. There is an
> infinity of things potentially expressible in XML, which a machine,
>
> even one programmed with very sophisticated heuristics, will not
> know
> how to tag correctly, even when it's starting with some kind of
> HTML tagging.
>
> Accordingly, generally successful efforts at this kind of
> conversion
> include both designing that format up front, and controlling its
> design carefully. Design it to concrete requirements, not just to
> what you think might be useful or fun to have some day, and don't
> be
> over-ambitious. You can't convert to a target you can't see. But if
>
> you have a design, the places where conversion is easy or difficult
>
> will fairly quickly come to light and you can figure out how to
> deal with them.
>
> I think earlier someone suggested you prototype this first before
> attempting it. That's very good advice.
>
> There are also professionals who will gladly share their experience
>
> in this area, if you are in a position to save money over the long
> term by investing it intelligently in the near term.
>
> Good luck,
> Wendell
>
> At 11:52 AM 3/9/2006, you wrote:
>
> >On Wed, March 8, 2006 5:28 pm, Florent Georges wrote:
> > > Walter Torres wrote:
> > >
> > >
> > >> 1) convert HMTL into well formed HTML (many are not)
> > >> 2) convert well formed HTML into xHTML
> > >>
> > >
> > > Tidy HTML will give you XHTML from HTML.
> >
> >Yes, just found it late last night. Been playing with it all
> morning.
> >
> >Getting it to work in PHP5 is waht I'm focusing on now.
> >
> >
> > >> 3) convert xHTML into XML
> > >>
> > >
> > > An XHTML instance is already an XML instance.
> >
> >Yes, I understand that.
> >
> >But I'm trying to get this to a "pure" xml, no display
> characteristics
> >markup what so ever!
> >
> >The idea here is to have a "raw/naked" file as possible, that way
> any
> >system can display this as they see fit.
> >
> >
> > > If you want to translate the instance from XHTML to an other
> XML document
> > > type, XSLT may be of great help.
> >
> >Sure, that way I can great a look for website A which is different
> than
> >website B, then create a text or RTF only or even email text or
> HTML or
> >even via web-phone.
> >
> >This is why I was asking about how different folks hand this kind
> of
> >content. What kind of markup it contains, etc.
> >
> >
> > >> 4) create XSLTs to transpose XML back to HTML for page display
> > >
> > > Here again, XSLT may be of great help.
> >
> >Right.
> >
> >Thanks
> >
> >Walter
>
>
>
======================================================================
> Wendell Piez
> mailto:wapiez@xxxxxxxxxxxxxxxx
> Mulberry Technologies, Inc.
> http://www.mulberrytech.com
> 17 West Jefferson Street Direct Phone:
> 301/315-9635
> Suite 207 Phone:
> 301/315-9631
> Rockville, MD 20850 Fax:
> 301/315-8285
>
----------------------------------------------------------------------
> Mulberry Technologies: A Consultancy Specializing in SGML and
> XML
>
======================================================================
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
|