Subject: Re: Ingoring HTML
From: "Sam D. Chuparkoff" <sdc@xxxxxxxxxx>
Date: Fri, 17 Jun 2005 13:39:59 -0700
|
On the dangerous side, I'd try something like:
perl -ne '$c.=$_;eof&&($c=~s/<(([^<>](?!<))*?)>//sg&print$c);'
foo.xml
Because it will probably be fine. For extra danger points, you can put
it in a Makefile with no comment.
You should be able to do something similar with xsl, but of course this
isn't very safe, and I think it would be a lot more complicated.
s/<(([^<>](?!<))*?)>//sg;
This is '<' some text '>' with no intervening '<', '<', or '>'
replaced with nothing. I thought about actually trying to turn this
content into xml, but note there's no close quote on that style
attribute! Watch out!
sdc
On Fri, 2005-06-17 at 15:13 -0500, Jon Gorman wrote:
> On 6/17/05, Jay Burgess <lists@xxxxxxxxxxx> wrote:
> > I apologize if this is in the FAQ, but I've searched and can't find it. (I'm
> > kind of new to XSL, so I may just have not seen it.)
>
> This is a faq of sorts, but I had a little bit of a difficult time
> finding an answer to it in Dave Pawson's FAQ as well. Of course, I
> just did a quick glance. I'd recommend skimming the the CDATA section
> as well.
>
> >
> > I've got some XML that contains HTML-formatted text. For example:
> >
> > <title><SPAN style="font-size: 13pt; font-family: Verdana; >The
> > <b>Text</b> That I Want</SPAN></title>
> >
>
> "HTML-formatted text" is a little bit nonsensical. HTML itself says
> that < is meant as a stand-in for <, so when you have it it's not a
> tag. Since namespaces were rather slow to get off to start, we ended
> up seeing people put so-called "HTML" in XML *cough* RSS *cough*. But
> to any XML application, this is one big chunk of text.
>
> So, some possible advice:
>
> 1) if you can change the input format so that it uses namespaces and
> actually embeds real XHTML into the documents you're creating, do so.
> Or at least have it be an option.
>
> 2) If you can't do that, I'm sure you can find a more general solution
> if you hunt through the archives. The essential solution will
> probably be along the lines of looking for < and >s and throwing
> any text in them out via some of the XPATH/XSLT string functions.
> Might be much easier with XSLT 2.0
>
> 3) It may be possible with a combination of d-o-e and doing multiple
> transformations, regex scripting or other techniques to replace the
> various < and > in certain elements but not others, then
> reprocess that document through your final stylesheet. Of couse, this
> makes it slightly dangerous.
>
> Dig through the archives there might be a more general solution
> already done or someone else will be able to give you one instead of
> just giving you some ranting. (I blame Friday afternoon and a slow
> server for my current long-winded explanation why this type of
> embedding is evil).
>
> Short answer, it's probably not difficult as long as it's relatively
> straightforward. If the "html" inside the xml is complex at all or
> you are using < in other places, you might have difficulty.
>
> Extremely simple if you can just have the input source use namespaces
> and you're comfortable with how XSLT deals with namespaces.
>
> Jon Gorman
|