Today’s data warehouse systems make it easy for analysts to access integrated data. In order to achieve this, the data warehouse development team had to process and model the data based on the requirements from the user. The best approach for developing a data warehouse is an iterative development process
[1]. That means that the functionality of the data warehouse, as requested by the business users, is designed, developed, implemented and deployed in iterations (sometimes called a sprint or a cycle). In each iteration, more functionality is added to the data warehouse. This is opposite to a “big-bang” approach where all functionality is developed in one large process and finally deployed as a whole.
However, when executing the project, even when using an iterative approach, the effort (and the costs tied to it) to add another functionality usually increases because of existing dependencies that have to be taken care of.
Figure 2.1 shows that the effort to implement the first information mart is relatively low. But when implementing the second information mart, the development team has to maintain the existing solution and take care of existing dependencies, for example to data sources integrated for the first information mart or operational systems consuming information from existing tables. In order to make sure that this previously built functionality doesn’t break when deploying the new functionality for the second information mart, the old functionality has to be retested. In many cases, the existing solution needs to be refactored to maintain the functionality of the individual information marts when new sources are added to the overall solution. All these activities increase the effort for creating the second information mart and, equally, any subsequent information mart or other new functionality. This additional effort is depicted as the rise of the graph in
Figure 2.1: once the first information mart is produced, the solution falls into a maintenance mode for all existing functionality. The next project implements another information mart. To implement the second information mart, the effort includes adding the new functionality and maintaining the existing functionality. Because of the dependencies, the existing functionality needs to be refactored and retested regularly.
Figure 2.1The maintenance nightmare [2]. In other words, the extensibility of many data warehouse architectures, including those presented in
Chapter 1, Introduction to Data Warehousing, is not optimal. Furthermore, typical data warehouse architectures often lack dimensions of scalability other than the described dimension of extensibility. We discuss these dimensions in the next section.
2.1. Dimensions of Scalable Data Warehouse Architectures
Business users of data warehouse systems expect to load and prepare more and more data, in terms of variety, volume, and velocity
[3]. Also, the workload that is put on typical data warehouse environments is increasing more and more, especially if the initial version of the warehouse has become a success with its first users. Therefore, scalability has multiple dimensions.
2.1.1. Workload
The enterprise data warehouse (EDW) is “by far the largest and most computationally intense business application” in a typical enterprise. EDW systems consist of huge databases, containing historical data on volumes from multiple gigabytes to terabytes of storage
[4]. Successful EDW systems face two issues regarding the workload of the system: first, they experience rapidly increasing data volumes and application workloads and, second, an increasing number of concurrent users
[5]. In order to meet the performance requirements, EDW systems are implemented on large-scale parallel computers, such as massively parallel processing (MPP) or symmetric multiprocessor (SMP) system environments and clusters and parallel database software. In fact, most medium- to large-size data warehouses could not be implementable without larger-scale parallel hardware and parallel database software to support them
[4].
In order to handle the requested workload, there is more required than parallel hardware or parallel database software. The logical and physical design of the databases has to be optimized for the expected data volumes [
6–
8].
2.1.2. Data Complexity
Another dimension of enterprise data warehouse scalability is data complexity. The following factors contribute to the growth of data complexity
[9]:
• Variety of data: nowadays, enterprise organizations capture more than just traditional (e.g., relational or mainframe) master or transactional data. There is an increasing amount of semi-structured data, for example emails, e-forms or HTML and XML files and unstructured data, such as document collections, social network data, images, video and sound files. Another type of data is sensor- and machine-generated data, which might require specific handling. In many cases, enterprises try to derive structured information from unstructured or semi-structured data to increase the business value of the data. While the files may have a structure, the content of the files doesn’t have one. For example, it is not possible to find the face of a specific person in a video without fully processing all frames of the video and building metadata tags to indicate where faces appear in the content.
• Volume of data: the rate at which companies generate and accumulate new data is increasing. Examples include content from Web sites or social networks, document and email collections, weblog data and machine-generated data. The increased data volume leads to much larger data sets, which can run into hundreds of terabytes or even into petabytes of data or beyond.
• Velocity of data: not only the variety and volume of data increases, but the rate at which the data is created also increases rapidly. One example is financial data from financial markets such as the stock exchange. Such data is generated at very high rates and immediately analyzed in order to respond to changes in the market. Other examples include credit card transactions data for fraud detection and sensor data or data from closed-circuit television (CCTV), which is captured for automated video and image analysis in real-time or near-real-time.
• Veracity (trustworthiness) of data: in order to have confidence in data, it must have strong data governance lineage traceability and robust data integration
[10].
2.1.3. Analytical Complexity
Due to the availability of large volumes of data with high velocity and variety, businesses demand different and more complex analytical tasks to produce the insight required to solve their business problems. Some of these analyses require that the data be prepared in a fashion not foreseen by the original data warehouse developers. For example, the data that should be fed into a data mining algorithm should have different characteristics regarding the variety, volume and velocity of data.
Consider the example of retail marketing: the campaign accuracy and timeliness need to be improved when moving from retail stores to online channels where more detailed customer insights are required
[11]:
• In order to determine customer segmentation and purchase behavior, the business might need historical analysis and reporting of customer demographics and purchase transactions
• Cross-sell opportunities can be identified by analyzing market baskets that show products that can be sold together
• To understand the online behavior of their customers, click-stream analysis is required. This can help to present up-sell offers to the visitors of a Web site
• Given the high amount of social network data and user-generated content, businesses tap into the data by analyzing product reviews, ratings, likes and dislikes, comments, customer service interactions, and so on.
These examples should make it clear that, in order to solve such new and complex analytical tasks, data sources of varying complexity are required. Also, mixing structured and unstructured data becomes more and more common
[11].
2.1.4. Query Complexity
When business intelligence (BI) vendors select a relational database management system (RDBMS) for the storage and management of warehouse data, it is a natural choice. Relational databases provide simple data structures and high-level, set-oriented languages that make...