E-Book

E-Book, Englisch, 684 Seiten, Format (B × H): 191 mm x 235 mm

Linstedt / Olschimke Building a Scalable Data Warehouse with Data Vault 2.0

1. Auflage 2015
ISBN: 978-0-12-802648-9
Verlag: William Andrew Publishing
Format: EPUB
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)

Häufig gestellte Fragen zu E-Books

E-Book, Englisch, 684 Seiten, Format (B × H): 191 mm x 235 mm

Building a Scalable Data Warehouse with Data Vault 2.0
Erscheinungsjahr 2015, 978-0-12-802510-9, Buch

E-Book, Englisch, 684 Seiten, Format (B × H): 191 mm x 235 mm

ISBN: 978-0-12-802648-9
Verlag: William Andrew Publishing
Format: EPUB
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)

Häufig gestellte Fragen zu E-Books

50,95 €

(inkl. MwSt.)

versandkostenfreie Lieferung
sofort verfügbar

The Data Vault was invented by Dan Linstedt at the U.S. Department of Defense, and the standard has been successfully applied to data warehousing projects at organizations of different sizes, from small to large-size corporations. Due to its simplified design, which is adapted from nature, the Data Vault 2.0 standard helps prevent typical data warehousing failures. 'Building a Scalable Data Warehouse' covers everything one needs to know to create a scalable data warehouse end to end, including a presentation of the Data Vault modeling technique, which provides the foundations to create a technical data warehouse layer. The book discusses how to build the data warehouse incrementally using the agile Data Vault 2.0 methodology. In addition, readers will learn how to create the input layer (the stage layer) and the presentation layer (data mart) of the Data Vault 2.0 architecture including implementation best practices. Drawing upon years of practical experience and using numerous examples and an easy to understand framework, Dan Linstedt and Michael Olschimke discuss: - How to load each layer using SQL Server Integration Services (SSIS), including automation of the Data Vault loading processes. - Important data warehouse technologies and practices. - Data Quality Services (DQS) and Master Data Services (MDS) in the context of the Data Vault architecture. - Provides a complete introduction to data warehousing, applications, and the business context so readers can get-up and running fast - Explains theoretical concepts and provides hands-on instruction on how to build and implement a data warehouse - Demystifies data vault modeling with beginning, intermediate, and advanced techniques - Discusses the advantages of the data vault approach over other techniques, also including the latest updates to Data Vault 2.0 and multiple improvements to Data Vault 1.0

Dan Linstedt has more than 25 years of experience in the Data Warehousing and Business Intelligence field and is internationally known for inventing the Data Vault 1.0 model and the Data Vault 2.0 System of Business Intelligence. He helps business and government organizations around the world to achieve BI excellence by applying his proven knowledge in Big Data, unstructured information management, agile methodologies and product development. He has held training classes and presented at TDWI, Teradata Partners, DAMA, Informatica, Oracle user groups and Data Modeling Zone conference. He has a background in SEI/CMMI Level 5, and has contributed architecture efforts to petabyte scale data warehouses and offers high quality on-line training and consulting services for Data Vault.

Linstedt / Olschimke Building a Scalable Data Warehouse with Data Vault 2.0 jetzt bestellen!

Autoren/Hrsg.

Linstedt, Dan

Linstedt, Daniel

Olschimke, Michael

Fachgebiete

Mathematik | Informatik EDV | Informatik Daten / Datenbanken Data Warehouse

Weitere Infos & Material

Inhaltsverzeichnis

Chapter 1: Introduction to Data Warehousing
Chapter 2: Scalable Data Warehouse Architecture
Chapter 3: Agile Data Warehouse Development Methodology
Chapter 4: The Data Vault 2.0 Methodology
Chapter 5: Data Vault Modeling
Chapter 6: Intermediate Data Vault Modeling
Chapter 7: Advanced Data Vault Modeling
Chapter 8: Dimensional Modeling
Chapter 9: Physical Database Design
Chapter 10: Master Data Management
Chapter 11: Metadata Management
Chapter 12: Data Extraction
Chapter 13: Loading the Data Vault
Chapter 14: Implementing Data Quality in the Data Vault
Chapter 15: Loading the Dimensional Data Mart
Chapter 16: Relational Reports (Optional)
Chapter 17: Multidimensional Database
Chapter 18: Refactoring the Data Warehouse
Chapter 19: Data Mining (Optional)
Chapter 20: Data Warehouse Administration (Optional)
Chapter 21: Best Practices (Optional)

Leseproben

Chapter 2

Scalable Data Warehouse Architecture

Abstract

Scalable data warehouses, as a desired solution to some of the problems introduced in the previous chapter, have specific architectural dimensions that are explained in this chapter, including workload, data complexity, query complexity, availability and data latency.

This chapter introduces the architecture of a Data Vault-based data warehouse, including the stage, data warehouse, and information mart layers. In addition, it shows how to use a business vault and other components of the proposed architecture.

Keywords

data complexity

query complexity

data latency

data warehouse

data vault

data architecture

Today’s data warehouse systems make it easy for analysts to access integrated data. In order to achieve this, the data warehouse development team had to process and model the data based on the requirements from the user. The best approach for developing a data warehouse is an iterative development process [1]. That means that the functionality of the data warehouse, as requested by the business users, is designed, developed, implemented and deployed in iterations (sometimes called a sprint or a cycle). In each iteration, more functionality is added to the data warehouse. This is opposite to a “big-bang” approach where all functionality is developed in one large process and finally deployed as a whole.

However, when executing the project, even when using an iterative approach, the effort (and the costs tied to it) to add another functionality usually increases because of existing dependencies that have to be taken care of.

Figure 2.1 shows that the effort to implement the first information mart is relatively low. But when implementing the second information mart, the development team has to maintain the existing solution and take care of existing dependencies, for example to data sources integrated for the first information mart or operational systems consuming information from existing tables. In order to make sure that this previously built functionality doesn’t break when deploying the new functionality for the second information mart, the old functionality has to be retested. In many cases, the existing solution needs to be refactored to maintain the functionality of the individual information marts when new sources are added to the overall solution. All these activities increase the effort for creating the second information mart and, equally, any subsequent information mart or other new functionality. This additional effort is depicted as the rise of the graph in Figure 2.1: once the first information mart is produced, the solution falls into a maintenance mode for all existing functionality. The next project implements another information mart. To implement the second information mart, the effort includes adding the new functionality and maintaining the existing functionality. Because of the dependencies, the existing functionality needs to be refactored and retested regularly.

Figure 2.1The maintenance nightmare [2].

In other words, the extensibility of many data warehouse architectures, including those presented in Chapter 1, Introduction to Data Warehousing, is not optimal. Furthermore, typical data warehouse architectures often lack dimensions of scalability other than the described dimension of extensibility. We discuss these dimensions in the next section.

2.1. Dimensions of Scalable Data Warehouse Architectures

Business users of data warehouse systems expect to load and prepare more and more data, in terms of variety, volume, and velocity [3]. Also, the workload that is put on typical data warehouse environments is increasing more and more, especially if the initial version of the warehouse has become a success with its first users. Therefore, scalability has multiple dimensions.

2.1.1. Workload

The enterprise data warehouse (EDW) is “by far the largest and most computationally intense business application” in a typical enterprise. EDW systems consist of huge databases, containing historical data on volumes from multiple gigabytes to terabytes of storage [4]. Successful EDW systems face two issues regarding the workload of the system: first, they experience rapidly increasing data volumes and application workloads and, second, an increasing number of concurrent users [5]. In order to meet the performance requirements, EDW systems are implemented on large-scale parallel computers, such as massively parallel processing (MPP) or symmetric multiprocessor (SMP) system environments and clusters and parallel database software. In fact, most medium- to large-size data warehouses could not be implementable without larger-scale parallel hardware and parallel database software to support them [4].

In order to handle the requested workload, there is more required than parallel hardware or parallel database software. The logical and physical design of the databases has to be optimized for the expected data volumes [6–8].

2.1.2. Data Complexity

Another dimension of enterprise data warehouse scalability is data complexity. The following factors contribute to the growth of data complexity [9]:

• Variety of data: nowadays, enterprise organizations capture more than just traditional (e.g., relational or mainframe) master or transactional data. There is an increasing amount of semi-structured data, for example emails, e-forms or HTML and XML files and unstructured data, such as document collections, social network data, images, video and sound files. Another type of data is sensor- and machine-generated data, which might require specific handling. In many cases, enterprises try to derive structured information from unstructured or semi-structured data to increase the business value of the data. While the files may have a structure, the content of the files doesn’t have one. For example, it is not possible to find the face of a specific person in a video without fully processing all frames of the video and building metadata tags to indicate where faces appear in the content.

• Volume of data: the rate at which companies generate and accumulate new data is increasing. Examples include content from Web sites or social networks, document and email collections, weblog data and machine-generated data. The increased data volume leads to much larger data sets, which can run into hundreds of terabytes or even into petabytes of data or beyond.

• Velocity of data: not only the variety and volume of data increases, but the rate at which the data is created also increases rapidly. One example is financial data from financial markets such as the stock exchange. Such data is generated at very high rates and immediately analyzed in order to respond to changes in the market. Other examples include credit card transactions data for fraud detection and sensor data or data from closed-circuit television (CCTV), which is captured for automated video and image analysis in real-time or near-real-time.

• Veracity (trustworthiness) of data: in order to have confidence in data, it must have strong data governance lineage traceability and robust data integration [10].

2.1.3. Analytical Complexity

Due to the availability of large volumes of data with high velocity and variety, businesses demand different and more complex analytical tasks to produce the insight required to solve their business problems. Some of these analyses require that the data be prepared in a fashion not foreseen by the original data warehouse developers. For example, the data that should be fed into a data mining algorithm should have different characteristics regarding the variety, volume and velocity of data.

Consider the example of retail marketing: the campaign accuracy and timeliness need to be improved when moving from retail stores to online channels where more detailed customer insights are required [11]:

• In order to determine customer segmentation and purchase behavior, the business might need historical analysis and reporting of customer demographics and purchase transactions

• Cross-sell opportunities can be identified by analyzing market baskets that show products that can be sold together

• To understand the online behavior of their customers, click-stream analysis is required. This can help to present up-sell offers to the visitors of a Web site

• Given the high amount of social network data and user-generated content, businesses tap into the data by analyzing product reviews, ratings, likes and dislikes, comments, customer service interactions, and so on.

These examples should make it clear that, in order to solve such new and complex analytical tasks, data sources of varying complexity are required. Also, mixing structured and unstructured data becomes more and more common [11].

2.1.4. Query Complexity

When business intelligence (BI) vendors select a relational database management system (RDBMS) for the storage and management of warehouse data, it is a natural choice. Relational databases provide simple data structures and high-level, set-oriented languages that make...

Über Autor(innen)

Olschimke, Michael
Michael has more than 15 years of experience in IT and has been working on business intelligence topics for the past eight years. He has consulted for a number of clients in the automotive industry, insurance industry and non-profits. In addition, he has consulted for government organizations in Germany on business intelligence topics. Michael is responsible for the Data Vault training program at Dörffler + Partner GmbH, a German consulting firm specialized in data warehousing and business intelligence. He is also a lecturer at the University of Applied Sciences and Arts in Hannover, Germany. In addition, he maintains DataVault.guru, a community site on Data Vault topics.

Linstedt, Dan
Dan has more than 25 years of experience in the Data Warehousing and Business Intelligence field and is internationally known for inventing the Data Vault 1.0 model and the Data Vault 2.0 System of Business Intelligence. He helps business and government organizations around the world to achieve BI excellence by applying his proven knowledge in Big Data, unstructured information management, agile methodologies and product development. He has held training classes and presented at TDWI, Teradata Partners, DAMA, Informatica, Oracle user groups and Data Modeling Zone conference. He has a background in SEI/CMMI Level 5, and has contributed architecture efforts to petabyte scale data warehouses and offers high quality on-line training and consulting services for Data Vault.

Produktsicherheit

Fragen zum Artikel?

Ihre Fragen, Wünsche oder Anmerkungen

Vorname*

Nachname*

Ihre E-Mail-Adresse*

Kundennr.

Ihre Nachricht*

Lediglich mit * gekennzeichnete Felder sind Pflichtfelder.

Wenn Sie die im Kontaktformular eingegebenen Daten durch Klick auf den nachfolgenden Button übersenden, erklären Sie sich damit einverstanden, dass wir Ihr Angaben für die Beantwortung Ihrer Anfrage verwenden. Selbstverständlich werden Ihre Daten vertraulich behandelt und nicht an Dritte weitergegeben. Sie können der Verwendung Ihrer Daten jederzeit widersprechen. Das Datenhandling bei Sack Fachmedien erklären wir Ihnen in unserer Datenschutzerklärung.

50,95 € (inkl. MwSt.)

sofort verfügbar

Webcode: www2.sack.de/4o4tl