[Home] [By Thread] [By Date] [Recent Entries]

  • From: Jack Bush <netbeansfan@y...>
  • To: Michael Kay <mike@s...>, xml-dev@l...
  • Date: Tue, 4 Nov 2008 19:48:01 -0800 (PST)

Hi Michael,
 
Thanks for responding to this question.
 
I have not had any luck with jdom-interest@j... forum at all since subscribing to them a few months back.
 
In the meantime, can you confirm that it is not possible to use Sax 6.5.x with JDOM according to http://www.cafeconleche.org/books/xmljava/chapters/ch16s05.html? Or is it because you are not familiar with JDOM?
 
Could anyone point me to a more useful JDOM forum to assistance with this question?
 
Many thanks,
 
Jack


From: Michael Kay <mike@s...>
To: Jack Bush <netbeansfan@y...>; xml-dev@l...
Sent: Wednesday, 5 November, 2008 12:39:48 AM
Subject: RE: How to parse XML document with default namespace with JDOM XPath

I see no Saxon code here. You are using the XPath engine that comes with JDOM. You might be better off asking on the JDOM list. I have to confess I'm surprised to see you declaring namespaces AFTER compiling the XPath expression, but I can't say I'm familiar with this API.
 
Michael Kay
http://www.saxonica.com/


From: Jack Bush [mailto:netbeansfan@y...]
Sent: 04 November 2008 13:02
To: xml-dev@l...
Subject: How to parse XML document with default namespace with JDOM XPath

Hi All,

 

I am having difficulty parsing using Saxon and TagSoup parser on a namespace html document. The relevant content of this document are as follows:

 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

……..

</head>

<body>

    <div id="container">

        <div id="content">

            <table class="sresults">

                <tr>

                    <td>

                        <a href="http://www.abc.com/areas" title=" Hollywood , CA "> hollywood </a>

                    </td>

                    <td>

                        <a href="http://www.abc.com/areas" title=" San Jose , CA "> san jose </a>

                    </td>

                    <td>

                        <a href="http://www.abc.com/areas" title=" San Francisco , CA "> san francisco </a>

                    </td>

                    <td>

                        <a href="http://www.abc.com/areas" title=" San Diego , CA "> San diego </a>

                    </td>

              </tr>

……….

</body>

</html>

 

Below is the relevant code snippets illustrates how I have attempted to retrieve the contents (value of  <a>):

 

             import java.util.*;

             import org.jdom.*;

             import org.jdom.xpath.*;

             import org.saxpath.*;

             import org.ccil.cowan.tagsoup.Parser;

 

( 1 )       frInHtml = new FileReader("C:\\Tmp\\ABC.html");

( 2 )       brInHtml = new BufferedReader(frInHtml);

( 3 ) //    SAXBuilder saxBuilder = new SAXBuilder("org.apache.xerces.parsers.SAXParser");

( 4 )       SAXBuilder saxBuilder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser");

( 5 )       org.jdom.Document jdomDocument = saxbuilder.build(brInHtml);

( 6 )       XPath xpath =  XPath.newInstance("/ns:html/ns:body/ns:div[@id='container']/ns:div[@id='content']/ns:table[@class='sresults']/ns:tr/ns:td/ns:a");

( 7 )       xpath.addNamespace("ns", "http://www.w3.org/1999/xhtml");

( 8 )       java.util.List list = (java.util.List) (xpath.selectNodes(jdomDocument));

( 9 )       Iterator iterator = list.iterator();

( 10 )     while (iterator.hasNext())

( 11 )     {

( 12 )            Object object = iterator.next();

( 13 ) //         if (object instanceof Element)

( 14 ) //               System.out.println(((Element)object).getTextNormalize());

( 15 )             if (object instanceof Content)

( 16 )                   System.out.println(((Content)object).getValue());

              }

….

 

This program would work on the same document without the default namespace, hence, it would not be necessary to include “ns” prefix along in the XPath statements (line 6-7) either. Moreover, I was using “org..apache.xerces.parsers.SAXParser” to have successfully retrieve content of <a> from the same document without default namespace in the past.

 

I would like to achieve the following objectives if possible:

 

( i ) Exclude DTD and namespace in order to simplifying the parsing process. How this could be done?

( ii ) If this is not possible, how to include it in XPath statements (line 6-7) so that the value of <a> is picked up correctly?

( iii ) Would changing from “org.apache.xerces.parsers.SAXParser” to “org.ccil.cowan.tagsoup.Parser” make any difference as far as using XPath is concerned?

( iv ) Failing to exlude DTD, how to change the lookup of a PUBLIC DTD to a local SYSTEM one and include a local DTD for reference?

 

I am running JDK 1.6.0_06, Netbeans 6.1, JDOM 1.1, Saxon6-5-5, Tagsoup 1.2 on Windows XP platform.

 

Any assistance would be appreciated.

 

Thanks in advance,

 

Jack



Search 1000's of available singles in your area at the new Yahoo!7 Dating. http://au.rd.yahoo.com/dating/mail/tagline1/*http://au.dating.yahoo.com/?cid=53151&pid=1011.


Search 1000's of available singles in your area at the new Yahoo!7 Dating. http://au.rd.yahoo.com/dating/mail/tagline1/*http://au.dating.yahoo.com/?cid=53151&pid=1011.

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member