E-Book, Englisch, 114 Seiten
Gilula Structured Search for Big Data
1. Auflage 2015
ISBN: 978-0-12-804652-4
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)
From Keywords to Key-objects
E-Book, Englisch, 114 Seiten
ISBN: 978-0-12-804652-4
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)
The WWW era made billions of people dramatically dependent on the progress of data technologies, out of which Internet search and Big Data are arguably the most notable. Structured Search paradigm connects them via a fundamental concept of key-objects evolving out of keywords as the units of search. The key-object data model and KeySQL revamp the data independence principle making it applicable for Big Data and complement NoSQL with full-blown structured querying functionality. The ultimate goal is extracting Big Information from the Big Data. As a Big Data Consultant, Mikhail Gilula combines academic background with 20 years of industry experience in the database and data warehousing technologies working as a Sr. Data Architect for Teradata, Alcatel-Lucent, and PayPal, among others. He has authored three books, including The Set Model for Database and Information Systems and holds four US Patents in Structured Search and Data Integration. - Conceptualizes structured search as a technology for querying multiple data sources in an independent and scalable manner. - Explains how NoSQL and KeySQL complement each other and serve different needs with respect to big data - Shows the place of structured search in the internet evolution and describes its implementations including the real-time structured internet search
Mikhail Gilula has over 20 years of experience in database and data warehousing technologies. He has authored 3 books on the subject including 'The Set Model for Database and Information Systems published by Addison-Wesley and ACM Press, and holds 4 US Patents in Data Integration and Structured Search. Mikhail's industry experience includes working as a Sr. Data Architect for PayPal, Alcatel-Lucent, and Teradata, among others."
Autoren/Hrsg.
Weitere Infos & Material
Chapter 1 Introduction to Structured Search
Abstract
This chapter compares side-by-side the features of the keyword search or information retrieval and the database search. The structured search is conceptualized as a technology for querying multiple data sources in an independent and scalable manner. It occupies the middle ground between keyword search and database search. As in the keyword search paradigm, query originators do not need to know the structure or the number of data sources being queried. As in the database paradigm, users can pose precise queries, control the output order, access data in real time, and manage the data security. Keywords
keyword search information retrieval e-commerce data security query independence query scalability It is contrary to reason to say that there is a vacuum or space in which there is absolutely nothing. Rene Descartes (Principia Philosophiae, 1644) 1.1. Limitations of Keyword Search
Contemporary search engines operate within the information retrieval (IR) paradigm where the search criteria consist of keywords and the search results are lists of web pages or, generally, lists of documents (texts), which include the specified combinations of keywords. IR existed in different forms long before the introduction of computers and its limitations motivated the query concept research resulting in database languages like SQL. For example, in 1960s and 1970s it was popular to talk about the “factographic” systems, which would enable searching for information or facts per se as opposed to searching for documents, such as books, patents, or articles, that may or may not contain the relevant information. The new IR incarnation came with the Internet and was advanced by the Internet search providers. The main limitations of the keyword search are as follows. Intrinsic search imprecision. By using only keywords, it is generally difficult to determine the real question existing in the mind of the query originator because the same keywords may be used to pose different questions. Also, when trying to narrow down the search by adding more keywords, there is a greater risk of not finding the relevant information. Search results only for humans. Since the results of the keyword search are typically the documents conveying information in natural languages, it is not easy to process the search results programmatically – not involving the human recipient. Of course, the web pages are always somewhat structured and sometimes consist of quite structured information, but the structure of each individual page is not known a priori and the difficulties of processing natural languages programmatically always remain. No user control over output order. The ordering of search results is controlled by search engines and is a valuable trade secret. Some e-commerce websites allow users to sort search results by the price of merchandise. However, since the results are produced using keywords, the users often need to look through most of the returned items anyway. For example, currently when a user of a big Internet marketplace specifies a model of a digital camera to search for, and chooses the “Price: lowest first” option, the first couple of hundred items in the output are not the listings of the camera but instead are the camera accessories because they tend to be cheaper. No security control. To index a document or a web page, search engines need full access to the source. In this context, security of information or parts of information has no place or meaning. No real-time access. Processing web pages and updating indexes takes time. It could be days or weeks before the updated web pages would appear in search results. Information can become stale or completely disappear during this period. Search engines are not green. Due to keyword search imprecision, most information returned by search engines is never viewed or consumed by users. This means excessive CPU and IO cycles, network traffic, and watts of energy are wasted in data centers. 1.2. Keyword Search in E-Commerce
One of the areas underserved by the keyword search is e-commerce. For example, there is no general way to search for all digital cameras with optical zoom more than 10, more than 10 megapixels, weighing less than 10 oz, and so on. The basic problems of locating merchandise using the keyword search are as follows. Inability of finding merchandise directly by specifications rather than by keywords like brand or model needed to retrieve product specifications. Research of complex items may take hours and still does not guarantee the best deals. It would be vastly more efficient to search by multiple item characteristics at once instead of going back and forth through dozens or hundreds of descriptions in order to compare them by several parameters. The search output rankings are generally unrelated to the qualities of merchandise (i.e., specifications) or the deals offered. Since the search results tend to be voluminous, high search ranks are critical for merchants. The keyword search puts buyers at a disadvantage because they are only able to look through the first few pages of an output, and whereby a better deal may be on the next page that they did not get to. To alleviate these problems, e-merchants use the following main techniques. • Improving product search rankings by implementing a variety of learning algorithms aimed at extracting more information from the natural language search inputs, in particular by analyzing the shopping behavior of the users. • Classifying the merchandise into search categories to minimize search outputs. However these techniques bring difficulties of their own and do not avoid the aforementioned problems altogether. Particularly, the classifiers require individual processing of each item description to assign it to the classifier categories that vary from store to store. If the categories change, the items need to be reprocessed. The classifiers are fixed and work only via an equality predicate. It is generally difficult to negate a feature, for example, saying one needs a printer with no duplex mode, or to specify one of an infinite number of conditions, which are not a part of the classifier at hand. As a result, millions and millions of hours are spent annually by customers trying to locate the right merchandise or services, and to research and compare them in order to receive the best deals. Another problem is the time it takes to sort through the voluminous responses generated by a keyword search. 1.3. Limitations of Database Search
The traditional alternative to the IR is the database paradigm, where the search criteria are formulated using a rich set of predicates and are evaluated on collections of structured records comprising typed fields, like numeric or character ones. Results of the database search are always precise, can be ordered by users, and can be programmatically processed since the semantics of each field is known a priori. However, unlike the keyword search it is not that easy to query a database. The query originator needs to know table names, column names, possibly units of measurement used in the tables, codes for certain values, etc. The search scalability problems arise when multiple databases or structured stores need to be accessed and the search results need to be combined. In the database world, adding data sources is much more complex than in the world of keyword search, where it is completely transparent; probably thousands of new data sources – web pages – participate in the Internet searches every day. Table 1.1 illustrates relative advantages and limitations of the two traditional query paradigms. Table 1.1 Keyword Search Versus Database Search Features Keyword Search Database Search Queries are independent from data sources Yes No Search is scalable – new data sources easily added Yes No Search precision not affected by scale No Yes Search output not only for humans No Yes Users can control output order No Yes Security control No Yes Real-time access No Yes 1.4. What is Structured Search?
Structured search is a technology for querying multiple data sources in independent and scalable manner. It occupies the middle ground between keyword search and database search. As in the keyword search paradigm, query originators need not know the structure or the number of data sources being queried. As in the database paradigm, users can formulate precise structured queries, control the output order, and access information in real time. The goal is to achieve the best of both worlds as shown in Table 1.2. Table 1.2 Structured Search Versus Keyword...