E-Book, Englisch, 237 Seiten
Weiss / Indurkhya / Zhang Text Mining
1. Auflage 2010
ISBN: 978-0-387-34555-0
Verlag: Springer US
Format: PDF
Kopierschutz: 1 - PDF Watermark
Predictive Methods for Analyzing Unstructured Information
E-Book, Englisch, 237 Seiten
ISBN: 978-0-387-34555-0
Verlag: Springer US
Format: PDF
Kopierschutz: 1 - PDF Watermark
Data mining is a mature technology. The prediction problem, looking for predictive patterns in data, has been widely studied. Strong me- ods are available to the practitioner. These methods process structured numerical information, where uniform measurements are taken over a sample of data. Text is often described as unstructured information. So, it would seem, text and numerical data are different, requiring different methods. Or are they? In our view, a prediction problem can be solved by the same methods, whether the data are structured - merical measurements or unstructured text. Text and documents can be transformed into measured values, such as the presence or absence of words, and the same methods that have proven successful for pred- tive data mining can be applied to text. Yet, there are key differences. Evaluation techniques must be adapted to the chronological order of publication and to alternative measures of error. Because the data are documents, more specialized analytical methods may be preferred for text. Moreover, the methods must be modi?ed to accommodate very high dimensions: tens of thousands of words and documents. Still, the central themes are similar.
Autoren/Hrsg.
Weitere Infos & Material
1;Preface;5
1.1;Audience;6
1.2;Supplementary Web Software;6
1.3;Acknowledgements;7
2;Contents;8
3;1 Overview of Text Mining;12
3.1;1.1 What’s Special about Text Mining?;12
3.2;1.2 What Types of Problems Can Be Solved?;17
3.3;1.3 Document Classification;18
3.4;1.4 Information Retrieval;19
3.5;1.5 Clustering and Organizing Documents;20
3.6;1.6 Information Extraction;21
3.7;1.7 Prediction and Evaluation;22
3.8;1.8 The Next Chapters;23
3.9;1.9 Historical and Bibliographical Remarks;24
4;2 From Textual Information to Numerical Vectors;25
4.1;2.1 Collecting Documents;25
4.2;2.2 Document Standardization;28
4.3;2.3 Tokenization;30
4.4;2.4 Lemmatization;31
4.5;2.5 Vector Generation for Prediction;35
4.6;2.6 Sentence Boundary Determination;46
4.7;2.7 Part-Of-Speech Tagging;47
4.8;2.8 Word Sense Disambiguation;49
4.9;2.9 Phrase Recognition;49
4.10;2.10 Named Entity Recognition;50
4.11;2.11 Parsing;50
4.12;2.12 Feature Generation;52
4.13;2.13 Historical and Bibliographical Remarks;54
5;3 Using Text for Prediction;57
5.1;3.1 Recognizing that Documents Fit a Pattern;59
5.2;3.2 How Many Documents Are Enough?;61
5.3;3.3 Document Classification;62
5.4;3.4 Learning to Predict from Text;64
5.5;3.5 Evaluation of Performance;87
5.6;3.6 Applications;91
5.7;3.7 Historical and Bibliographical Remarks;92
6;4 Information Retrieval and Text Mining;95
6.1;4.1 Is Information Retrieval a Form of Text Mining?;95
6.2;4.2 Key Word Search;97
6.3;4.3 Nearest-Neighbor Methods;98
6.4;4.4 Measuring Similarity;99
6.5;4.5 Web-Based Document Search;102
6.6;4.6 Document Matching;107
6.7;4.7 Inverted Lists;108
6.8;4.8 Evaluation of Performance;110
6.9;4.9 Historical and Bibliographical Remarks;111
7;5 Finding Structure in a Document Collection;113
7.1;5.1 Clustering Documents by Similarity;116
7.2;5.2 Similarity of Composite Documents;117
7.3;5.3 What Do a Cluster’s Labels Mean?;130
7.4;5.4 Applications;132
7.5;5.5 Evaluation of Performance;133
7.6;5.6 Historical and Bibliographical Remarks;136
8;6 Looking for Information in Documents;139
8.1;6.1 Goals of Information Extraction;139
8.2;6.2 Finding Patterns and Entities from Text;142
8.3;6.3 Coreference and Relationship Extraction;155
8.4;6.4 Template Filling and Database Construction;159
8.5;6.5 Applications;161
8.6;6.6 Historical and Bibliographical Remarks;164
9;7 Case Studies;167
9.1;7.1 Market Intelligence from the Web;167
9.2;7.2 Lightweight Document Matching for Digital Libraries;173
9.3;7.3 Generating Model Cases for Help Desk Applications;177
9.4;7.4 Assigning Topics to News Articles;182
9.5;7.5 E-mail Filtering;188
9.6;7.6 Search Engines;192
9.7;7.7 Extracting Named Entities from Documents;196
9.8;7.8 Customized Newspapers;201
9.9;7.9 Historical and Bibliographical Remarks;204
10;8 Emerging Directions;206
10.1;8.1 Summarization;207
10.2;8.2 Active Learning;210
10.3;8.3 Learning with Unlabeled Data;211
10.4;8.4 Different Ways of Collecting Samples;212
10.5;8.5 Question Answering;217
10.6;8.6 Historical and Bibliographical Remarks;219
11;Appendix: Software Notes;221
11.1;A. 1 Summary of Software;221
11.2;A.2 Requirements;222
11.3;A.3 Download Instructions;223
12;References;224
13;Author Index;236
14;Subject Index;240




