[Home] [By Thread] [By Date] [Recent Entries]

  • To: <xml-dev@l...>
  • Subject: off-topic -- search engines
  • From: "Jason Kohls" <jkohls@i...>
  • Date: Tue, 30 Sep 2003 10:20:30 -0400
  • Thread-index: AcOHVJ2AtWFF2BgGRam+jB0IM3ZY7AAAA99A
  • Thread-topic: off-topic -- search engines

Greetings,

I realise this is slightly off-topic.  However:
A) I can't find a search engine mailing list (know of any?)
B) I knew I could count on my knowledgeable XML brothers. :)

Indexing your content stored in XML for your content-rich site -- many
articles, many white papers, etc.  Should the "crawler" have access to
the data layer, with rules and exceptions applied much like you would a
"normal" query i.e. only crawl the <content> nodes with a value of
"article" for the "type" attribute.

Or should it access the content at a much higher abstraction, say
through HTTP GET, like a GoogleBot or an AltaVistaBot?

My concerns are based around granularity, exclusivity, and accuracy --
if an article is rendered on a page with navigation items, footer,
copyright, etc., will it "skew" the results or even worse, actually
return a record for "copyright mycompany"?  What about an article called
"How to Buy a Search Engine".  This article is linked many, many times
throughout the site.  If I search on "Search Engine", what will the
results return?  All those pages that had the title text/link in it?

I realise that these search engines have built-in exceptions but my
concern is that these are at a high-level (post HTML rendering) not at
the data layer where more specific, "limitless" control is available.

Thanks for humoring me.

Jason Kohls 

The xml-dev list is sponsored by XML.org 
<http://www.xml.org>, an initiative of OASIS 
<http://www.oasis-open.org>

The list archives are at http://lists.xml.org/archives/xml-dev/

To subscribe or unsubscribe from this list use the subscription
manager: <http://lists.xml.org/ob/adm.pl>


Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member