E-Book, Englisch, 378 Seiten
Inmon / Linstedt Data Architecture: A Primer for the Data Scientist
1. Auflage 2014
ISBN: 978-0-12-802091-3
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: 6 - ePub Watermark
Big Data, Data Warehouse and Data Vault
E-Book, Englisch, 378 Seiten
ISBN: 978-0-12-802091-3
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: 6 - ePub Watermark
Today, the world is trying to create and educate data scientists because of the phenomenon of Big Data. And everyone is looking deeply into this technology. But no one is looking at the larger architectural picture of how Big Data needs to fit within the existing systems (data warehousing systems). Taking a look at the larger picture into which Big Data fits gives the data scientist the necessary context for how pieces of the puzzle should fit together. Most references on Big Data look at only one tiny part of a much larger whole. Until data gathered can be put into an existing framework or architecture it can't be used to its full potential. Data Architecture a Primer for the Data Scientist addresses the larger architectural picture of how Big Data fits with the existing information infrastructure, an essential topic for the data scientist. Drawing upon years of practical experience and using numerous examples and an easy to understand framework. W.H. Inmon, and Daniel Linstedt define the importance of data architecture and how it can be used effectively to harness big data within existing systems. You'll be able to: - Turn textual information into a form that can be analyzed by standard tools. - Make the connection between analytics and Big Data - Understand how Big Data fits within an existing systems environment - Conduct analytics on repetitive and non-repetitive data - Discusses the value in Big Data that is often overlooked, non-repetitive data, and why there is significant business value in using it - Shows how to turn textual information into a form that can be analyzed by standard tools - Explains how Big Data fits within an existing systems environment - Presents new opportunities that are afforded by the advent of Big Data - Demystifies the murky waters of repetitive and non-repetitive data in Big Data
Best known as the 'Father of Data Warehousing,' Bill Inmon has become the most prolific and well-known author worldwide in the big data analysis, data warehousing and business intelligence arena. In addition to authoring more than 50 books and 650 articles, Bill has been a monthly columnist with the Business Intelligence Network, EIM Institute and Data Management Review. In 2007, Bill was named by Computerworld as one of the 'Ten IT People Who Mattered in the Last 40 Years of the computer profession. Having 35 years of experience in database technology and data warehouse design, he is known globally for his seminars on developing data warehouses and information architectures. Bill has been a keynote speaker in demand for numerous computing associations, industry conferences and trade shows. Bill Inmon also has an extensive entrepreneurial background: He founded Pine Cone Systems, later named Ambeo in 1995, and founded, and took public, Prism Solutions in 1991. Bill consults with a large number of Fortune 1000 clients, and leading IT executives on Data Warehousing, Business Intelligence, and Database Management, offering data warehouse design and database management services, as well as producing methodologies and technologies that advance the enterprise architectures of large and small organizations world-wide. He has worked for American Management Systems and Coopers & Lybrand. Bill received his Bachelor of Science degree in Mathematics from Yale University, and his Master of Science degree in Computer Science from New Mexico State University."
Autoren/Hrsg.
Weitere Infos & Material
1;Front Cover;1
2;The Shanidar Neandertals;4
3;Copyright Page;5
4;Table of Contents;8
5;Dedication;6
6;Figures;12
7;Tables;16
8;Preface;20
9;Acknowledgments;22
10;CHAPTER
1. Introduction;28
11;CHAPTER
2. Shanidar Cave and the Discovery of the Shanidar Neandertals;31
11.1;The Site of Shanidar Cave;31
11.2;History of Excavations;41
11.3;The Neandertal Partial Skeletons;43
12;CHAPTER
3. Morphometric Considerations;58
13;CHAPTER
4. Age and Sex of the Shanidar Neandertals;63
13.1;Age;63
13.2;Sex;70
13.3;Summary;80
14;CHAPTER
5. The Cranial and Mandibular Remains;81
14.1;Shanidar 1;81
14.2;Shanidar 2;117
14.3;Shanidar 4;135
14.4;Shanidar 5;150
14.5;Shanidar 6;170
14.6;Shanidar 8;171
14.7;Artificial Deformation of the Shanidar 1 and 5 Crania;172
14.8;Summary of the Shanidar Skull Morphology;174
15;CHAPTER
6. The Dental Remains;178
15.1;Shanidar 1;178
15.2;Shanidar 2;182
15.3;Shanidar 3;186
15.4;Shanidar 4;187
15.5;Shanidar 5;187
15.6;Shanidar 6;191
15.7;Anterior Dental Remains;192
15.8;Posterior Dental Remains;198
15.9;Taurodontism;202
15.10;Summary;204
16;CHAPTER
7. The Axial Skeleton;205
16.1;Cervical Vertebrae;205
16.2;Thoracic Vertebrae;214
16.3;Lumbar Vertebrae;216
16.4;Sacrum;225
16.5;Coccygeal Vertebra;232
16.6;Ribs;233
16.7;Sternum;235
16.8;Summary;237
17;CHAPTER
8. The Upper Limb Remains;238
17.1;Clavicles;238
17.2;Scapulae;242
17.3;Humeri;250
17.4;Ulnae;259
17.5;Radii;266
17.6;Hand Remains;275
17.7;Summary;309
18;CHAPTER
9. The Lower Limb Remains;311
18.1;Innominate Bones;311
18.2;Femora;322
18.3;Patellae;331
18.4;Tibiae;337
18.5;Fibulae;347
18.6;Foot Remains;352
18.7;Summary;395
19;Chapter 10. The Immature Remains;396
19.1;Cranial Remains;397
19.2;Dentition;399
19.3;Axial Skeleton;408
19.4;Upper Limb Remains;409
19.5;Lower Limb Remains;414
19.6;Summary;416
20;CHAPTER
11. Bodily Proportions and the Estimation of Stature;417
20.1;Bodily Proportions;417
20.2;Estimation of Stature;422
21;CHAPTER
12. The Paleopathology of the Shanidar Neandertals;426
21.1;Shanidar 1;428
21.2;Shanidar 2;440
21.3;Shanidar 3;441
21.4;Shanidar 4;445
21.5;Shanidar 5;446
21.6;Shanidar 6;448
21.7;Shanidar 8;448
21.8;Summary;449
22;CHAPTER
13. Significant Aspects of the Shanidar Neandertals;451
22.1;The Shanidar Sample;451
22.2;The Shanidar Fossils as Neandertals;453
22.3;Evolutionary Trends in the Shanidar Sample;463
22.4;The Shanidar
Neandertals as Near Eastern Fossil Hominids;468
22.5;Behavioral Implications of the Shanidar Neandertals;482
22.6;Conclusion;487
23;CHAPTER
14. Some Thoughts on the Evolution of the Neandertals;488
23.1;Historical Background;488
23.2;Phylogenetic Relationships;490
23.3;Neandertal Behavior;497
23.4;Conclusion;499
24;References;500
25;Index;526
1.1 Corporate Data
Abstract
Corporate data includes everything found in the corporation in the way of data. The most basic division of corporate data is by structured data and unstructured data. As a rule there is much more unstructured data than structured data. Unstructured data has two basic divisions – repetitive data and nonrepetitive data. Big Data is made up of unstructured data. Nonrepetitive Big Data has a fundamentally different form than repetitive unstructured Big Data. In fact the differences between nonrepetitive Big Data and repetitive Big Data are so large that they can be called the boundaries of the “great divide.” The divide is so large many professionals are not even aware that there is this divide. As a rule nonrepetitive Big Data has much greater business value than repetitive Big Data. Keywords
Big Data business value corporate data great divide of data nonrepetitive data repetitive data structured data unstructured data In today’s world it is easy to get lost when dealing with data. There are many different types of data and each type of data has its own peculiarities and idiosyncrasies. Products, vendors, and applications become so focused on their own specific world that the larger picture of how things fit together often gets lost. It oftentimes is useful to step back and look at the larger picture to gain a proper perspective. The Totality of Data Across the Corporation
Consider the totality of data found in the corporation. A simplistic depiction of the totality of data found in the corporation is seen in Figure 1.1.1. Figure 1.1.1 The totality of data represented here includes everything to do with data of any kind found in the corporation. There are many ways to subdivide the totality of data in the corporation. One such way (but hardly the only way) to subdivide the data found in the corporation is to divide the totality of data into structured data and unstructured data, as seen in Figure 1.1.2. Figure 1.1.2 Structured data is the data that has a predictable and regularly occurring format of data. Typically structured data is managed by a database management system (DBMS) and consists of records, attributes, keys, and indexes. Structured data is well defined, predictable, and managed by an elaborate infrastructure. As a rule most units of data in the structured environment can be located very quickly and easily. Unstructured data, conversely, is data that is unpredictable and has no structure that is recognizable to a computer. As a rule, unstructured data is rather clumsy to access, where long strings of data have to be sequentially searched (parsed) in order to find a given unit of data. There are many forms and variations of unstructured data. Perhaps the most commonly occurring form of unstructured data is text. However, by no stretch of the imagination is text the only form of unstructured data. Dividing Unstructured Data
Unstructured data can further be divided into two basic forms of data – repetitive unstructured data and nonrepetitive unstructured data. As is the case with the division of corporate data, there are many ways to subdivide unstructured data. The method shown here is but one of many ways to subdivide unstructured data. This simple subdivision of unstructured data is shown in Figure 1.1.3. Figure 1.1.3 Repetitive unstructured data is data that occurs many times, often in the same structure and even in the exact same embodiment. Typically, repetitive data occurs many, many times. The structure of repetitive data looks exactly the same or substantially the same as the previous record. There is no massive and elaborate infrastructure managing the content of repetitive unstructured data. Nonrepetitive unstructured data is data where the records are substantially different from each other. In general each nonrepetitive record is markedly different from each other record. The division of data types in the corporation has many different embodiments. Consider the data as shown in Figure 1.1.4. Figure 1.1.4 Structured data is typically found as a by-product of transactions. Every time a sale is made, every time a bank account encounters a withdrawal, every time someone transacts an ATM activity, and every time a bill is sent a record of the transaction is made. The record of the transaction ends up as a structured record. Unstructured repetitive data is quite different. Unstructured repetitive records are typically records of machine interactions, such as the analog verification of product coming off a manufacturing process or the metering of energy usage by a consumer. Consider metering. There is great repetition of records in both form and substance that are created when looking at metered readings. Unstructured nonrepetitive information is fundamentally different than unstructured repetitive records. With unstructured nonrepetitive records there is little or no repetition of either form or content from one record to the next. Some examples of unstructured nonrepetitive information include email, call center conversations, and market research. When you look at one email, the odds are very good that the next email in the database will be different than the previous email. The same is true for call center information, warranty claims, market research, and so forth. Business Relevancy
Unstructured repetitive data and unstructured nonrepetitive data have very different characteristics, in many different ways. One of the ways that these two types of data are different is in terms of business relevancy. In unstructured repetitive data, there often are very few records that are of real business interest. With unstructured nonrepetitive data, however, there is a very large percentage of business-relevant data. This difference between the two types of data is shown in Figure 1.1.5. Figure 1.1.5 As an example of a small percentage of repetitive unstructured data being business relevant, consider the millions of phone calls that are made each day. The government is only interested in a very few phone calls out of the millions that have been made. Or consider manufacturing control information. Nearly all manufacturing records are not of interest. Only a very few records – usually where the parameters being measured exceed a threshold – are of interest. Oftentimes with unstructured repetitive records, there are records that are not directly or immediately of interest but are potentially of interest in this category. There are not too many records that are not of interest when it comes to unstructured nonrepetitive data. There is spam and there are stop words. But other than those two categories of information, nearly all unstructured nonrepetitive data is of interest. Big Data
It is of interest to note that Big Data consists of the unstructured repetitive and the unstructured nonrepetitive data in the corporation, as seen in Figure 1.1.6. Figure 1.1.6 The Great Divide
At first it may seem that the differences between the two types of unstructured data – unstructured repetitive and unstructured nonrepetitive data – are almost whimsical or trivial. In fact the differences between the two types of unstructured data are anything but trivial. Because of the profound differences between the two types of data, there is a great divide that separates the two types of unstructured data. Figure 1.1.7 shows the great divide that separates the two types of unstructured data. Figure 1.1.7 The great divide that separates the two types of unstructured data occurs because data on one side of the divide is handled one way and data on the other side of the divide is handled in an entirely different manner. For all practical purposes the data found on the different sides of the great divide might as well exist on different planets. The division in the way that data is handled is such that unstructured repetitive data is almost entirely consumed with a fixation on managing Hadoop. For unstructured repetitive data the emphasis is entirely on accessing, monitoring, displaying, analyzing, and visualizing data residing on a Big Data manager such as Hadoop. The emphasis on unstructured nonrepetitive data is almost entirely centered on textual disambiguation. The emphasis here is on the types of disambiguation, the reformatting of the output, the contextualization of the data, the standardization of the data, and so forth. The remarkable thing about the great divide is that the disciplines surrounding the data are so diametrically different. Textual disambiguation is a very different subject than the access and analysis of data stored on Hadoop. It is because of the extreme differences between these two worlds that it is said that the two worlds live in different planets. To use an analogy to illustrate just...