E-Book, Englisch, Band 17, 205 Seiten, eBook
Kruschwitz Intelligent Document Retrieval
1. Auflage 2006
ISBN: 978-1-4020-3768-9
Verlag: Springer Netherland
Format: PDF
Kopierschutz: 1 - PDF Watermark
Exploiting Markup Structure
E-Book, Englisch, Band 17, 205 Seiten, eBook
Reihe: The Information Retrieval Series
ISBN: 978-1-4020-3768-9
Verlag: Springer Netherland
Format: PDF
Kopierschutz: 1 - PDF Watermark
Zielgruppe
Professional/practitioner
Autoren/Hrsg.
Weitere Infos & Material
Related Work.- Data Analysis and Domain Model Construction.- Incorporating Additional Knowledge.- A Dialogue System for Partially Structured Data.- UKSearch - Intelligent Web Search.- UKSearch - Evaluation and Discussion.- YPA - Searching Classified Directories.- Future Directions and Conclusions.
6 UKSearch - Intelligent Web Search (p.93-94)
Finding information on the Web is normally a straightforward task. For most user requests the information can be located by applying a standard search engine using simple pattern matching techniques. However, by restricting the search to some smaller document collection (one that is still too large to be searched without appropriate tools) this can become a tedious task. Examples of such collections are corporate intranets or university Web sites. Typically a search will return large numbers of matching documents even in smaller document collections. If no matching document can be found, the user is usually either left alone with a great number of partially matching documents or with no results at all.
These are well known problems and approaches for more sophisticated search systems exist to overcome them (see Chap. 2). But those approaches tend to rely very much on a given document structure or expensively created concept hierarchies. While this is appropriate for fairly well structured domains such as product catalogues and other applications where the information is stored in database formats, it is no help if the document collection is heterogeneous.
Surprisingly perhaps, the problem of not .nding any document in the collection for a user query (a form of "data sparsity") is not necessarily a major problem in small domains. The log .les of the search engine installed at the University of Essex Web site prove that the majority of queries that users submit result in a large number of matching documents despite the fairly small size of the collection. But unlike in general Web search where scalability issues prevent the application of more sophisticated indexing steps, we can build domain-speci.c concept hierarchies easily and rapidly in such well-de.ned document collections using the techniques introduced in the earlier chapters. These automatically created knowledge sources re.ect the relations between documents or terms within those documents simply based on the available data.
A part from that, collections of Web pages are well suited to verify the techniques introduced in this book, as these documents are typically marked up using HTML tags. This type of markup mixes visual markup and semantic representation (as found in the meta tags for example). We turn this implicit knowledge into explicit relations.
The earlier chapters presented the conceptual framework. Here we discuss the practical steps that lead to an explicitly structured representation of a Web document collection. Frequently used HTML tags are used to de.ne markup contexts (the fundamental units to extract concepts which are then arranged in a domain model). The structure imposed on the data collection is employed in a dialogue system which assists the user with handling those queries that do not retrieve documents or result in large numbers of matches.
We will see how the general dialogue manager introduced earlier is set up to work with the data collections discussed in this chapter. We will however not focus on the links between concepts and individual documents or directories. The more interesting aspect is the construction of domain models that are not closely tied to the individual documents, mainly because a separable domain model is more .exible. The reason is that despite the ever-changing nature of a collection of Web documents we will not need to constantly update the model. A domain model that is not linked to the individual documents will still be usable once the document collection has been updated. It can simply be plugged into a search system.