E-Book, Englisch, Band 8, 387 Seiten
Stahlbock / Crone / Lessmann Data Mining
1. Auflage 2009
ISBN: 978-1-4419-1280-0
Verlag: Springer
Format: PDF
Kopierschutz: 1 - PDF Watermark
Special Issue in Annals of Information Systems
E-Book, Englisch, Band 8, 387 Seiten
Reihe: Annals of Information Systems
ISBN: 978-1-4419-1280-0
Verlag: Springer
Format: PDF
Kopierschutz: 1 - PDF Watermark
Over the course of the last twenty years, research in data mining has seen a substantial increase in interest, attracting original contributions from various disciplines including computer science, statistics, operations research, and information systems. Data mining supports a wide range of applications, from medical decision making, bioinformatics, web-usage mining, and text and image recognition to prominent business applications in corporate planning, direct marketing, and credit scoring. Research in information systems equally reflects this inter- and multidisciplinary approach, thereby advocating a series of papers at the intersection of data mining and information systems research. This special issue of Annals of Information Systems contains original papers and substantial extensions of selected papers from the 2007 and 2008 International Conference on Data Mining (DMIN'07 and DMIN'08, Las Vegas, NV) that have been rigorously peer-reviewed. The issue brings together topics on both information systems and data mining, and aims to give the reader a current snapshot of the contemporary research and state of the art practice in data mining.
Autoren/Hrsg.
Weitere Infos & Material
1;Preface;5
2;Contents;7
3;1 Data Mining and Information Systems: Quo Vadis?;14
3.1;Robert Stahlbock, Stefan Lessmann, and Sven F. Crone;14
3.2;1.1 Introduction;14
3.3;1.2 Special Issues in Data Mining;16
3.3.1;1.2.1 Confirmatory Data Analysis;16
3.3.2;1.2.2 Knowledge Discovery from Supervised Learning;17
3.3.3;1.2.3 Classification Analysis;19
3.3.4;1.2.4 Hybrid Data Mining Procedures;21
3.3.5;1.2.5 Web Mining;23
3.3.6;1.2.6 Privacy-Preserving Data Mining;24
3.4;1.3 Conclusion and Outlook;25
3.5;References;26
4;Part I Confirmatory Data Analysis;29
4.1;2 Response-Based Segmentation Using Finite Mixture Partial Least Squares;30
4.1.1;Christian M. Ringle, Marko Sarstedt, and Erik A. Mooi;30
4.1.2;2.1 Introduction;31
4.1.2.1;2.1.1 On the Use of PLS Path Modeling;31
4.1.2.2;2.1.2 Problem Statement;33
4.1.2.3;2.1.3 Objectives and Organization;34
4.1.3;2.2 Partial Least Squares Path Modeling;35
4.1.4;2.3 Finite Mixture Partial Least Squares Segmentation;37
4.1.4.1;2.3.1 Foundations;37
4.1.4.2;2.3.2 Methodology;39
4.1.4.3;2.3.3 Systematic Application of FIMIX-PLS;42
4.1.5;2.4 Application of FIMIX-PLS;45
4.1.5.1;2.4.1 On Measuring Customer Satisfaction;45
4.1.5.2;2.4.2 Data and Measures;45
4.1.5.3;2.4.3 Data Analysis and Results;47
4.1.6;2.5 Summary and Conclusion;55
4.1.7;References;56
5;Part II Knowledge Discovery from Supervised Learning;61
5.1;3 Building Acceptable Classification Models;62
5.1.1;David Martens and Bart Baesens;62
5.1.2;3.1 Introduction;63
5.1.3;3.2 Comprehensibility of Classification Models;64
5.1.3.1;3.2.1 Measuring Comprehensibility;66
5.1.3.2;3.2.2 Obtaining Comprehensible Classification Models;67
5.1.3.2.1;3.2.2.1 Building Rule-Based Models;67
5.1.3.2.2;3.2.2.2 Combining Output Types;67
5.1.3.2.3;3.2.2.3 Visualization;67
5.1.4;3.3 Justifiability of Classification Models;68
5.1.4.1;3.3.1 Taxonomy of Constraints;69
5.1.4.2;3.3.2 Monotonicity Constraint;71
5.1.4.3;3.3.3 Measuring Justifiability;72
5.1.4.4;3.3.4 Obtaining Justifiable Classification Models;77
5.1.5;3.4 Conclusion;79
5.1.6;References;80
5.2;4 Mining Interesting Rules Without Support Requirement: A General Universal Existential Upward Closure Property;84
5.2.1;Yannick Le Bras, Philippe Lenca, and Stéphane Lallich;84
5.2.2;4.1 Introduction;85
5.2.3;4.2 State of the Art;86
5.2.4;4.3 An Algorithmic Property of Confidence;89
5.2.4.1;4.3.1 On UEUC Framework;89
5.2.4.2;4.3.2 The UEUC Property;89
5.2.4.3;4.3.3 An Efficient Pruning Algorithm;90
5.2.4.4;4.3.4 Generalizing the UEUC Property;91
5.2.5;4.4 A Framework for the Study of Measures;93
5.2.5.1;4.4.1 Adapted Functions of Measure;93
5.2.5.1.1;4.4.1.1 Association Rules;93
5.2.5.1.2;4.4.1.2 Contingency Tables;93
5.2.5.1.3;4.4.1.3 Minimal Joint Domain;1
5.2.5.2;4.4.2 Expression of a Set of Measures of Ddconf;96
5.2.6;4.5 Conditions for GUEUC;99
5.2.6.1;4.5.1 A Sufficient Condition;99
5.2.6.2;4.5.2 A Necessary Condition;100
5.2.6.3;4.5.3 Classification of the Measures;101
5.2.7;4.6 Conclusion;103
5.2.8;References;104
5.3;5 Classification Techniques and Error Control in Logic Mining;108
5.3.1;Giovanni Felici, Bruno Simeone, and Vincenzo Spinelli;108
5.3.2;5.1 Introduction;109
5.3.3;5.2 Brief Introduction to Box Clustering;111
5.3.4;5.3 BC-Based Classifier;113
5.3.5;5.4 Best Choice of a Box System;117
5.3.6;5.5 Bi-criterion Procedure for BC-Based Classifier;120
5.3.7;5.6 Examples;121
5.3.7.1;5.6.1 The Data Sets;121
5.3.7.2;5.6.2 Experimental Results with BC;122
5.3.7.3;5.6.3 Comparison with Decision Trees;124
5.3.8;5.7 Conclusions;126
5.3.9;References;126
6;Part III Classification Analysis;129
6.1;6 An Extended Study of the Discriminant Random Forest;130
6.1.1;Tracy D. Lemmond, Barry Y. Chen, Andrew O. Hatch,and William G. Hanley;130
6.1.2;6.1 Introduction;130
6.1.3;6.2 Random Forests;131
6.1.4;6.3 Discriminant Random Forests;132
6.1.4.1;6.3.1 Linear Discriminant Analysis;133
6.1.4.2;6.3.2 The Discriminant Random Forest Methodology;134
6.1.5;6.4 DRF and RF: An Empirical Study;135
6.1.5.1;6.4.1 Hidden Signal Detection;136
6.1.5.1.1;6.4.1.1 Training on T1, Testing on J2;137
6.1.5.1.2;6.4.1.2 Prediction Performance for J2 with Cross-validation;138
6.1.5.2;6.4.2 Radiation Detection;139
6.1.5.3;6.4.3 Significance of Empirical Results;143
6.1.5.4;6.4.4 Small Samples and Early Stopping;144
6.1.5.5;6.4.5 Expected Cost;150
6.1.6;6.5 Conclusions;150
6.1.7;References;152
6.2;7 Prediction with the SVM Using Test Point Margins;154
6.2.1;Süreyya Özögür-Akyüz, Zakria Hussain, and John Shawe-Taylor;154
6.2.2;7.1 Introduction;154
6.2.3;7.2 Methods;158
6.2.4;7.3 Data Set Description;161
6.2.5;7.4 Results;161
6.2.6;7.5 Discussion and Future Work;162
6.2.7;References;164
6.3;8 Effects of Oversampling Versus Cost-Sensitive Learning for Bayesian and SVM Classifiers;166
6.3.1;Alexander Liu, Cheryl Martin, Brian La Cour, and Joydeep Ghosh;166
6.3.2;8.1 Introduction;166
6.3.3;8.2 Resampling;168
6.3.3.1;8.2.1 Random Oversampling;168
6.3.3.2;8.2.2 Generative Oversampling;168
6.3.4;8.3 Cost-Sensitive Learning;169
6.3.5;8.4 Related Work;170
6.3.6;8.5 A Theoretical Analysis of Oversampling Versus Cost-Sensitive Learning;171
6.3.6.1;8.5.1 Bayesian Classification;171
6.3.6.2;8.5.2 Resampling Versus Cost-Sensitive Learning in Bayesian Classifiers;172
6.3.6.3;8.5.3 Effect of Oversampling on Gaussian Naive Bayes;173
6.3.6.3.1;8.5.3.1 Random Oversampling;174
6.3.6.3.2;8.5.3.2 Generative Oversampling;174
6.3.6.3.3;8.5.3.3 Comparison to Cost-Sensitive Learning;175
6.3.6.4;8.5.4 Effects of Oversampling for Multinomial Naive Bayes;175
6.3.7;8.6 Empirical Comparison of Resampling and Cost-SensitiveLearning;177
6.3.7.1;8.6.1 Explaining Empirical Differences Between Resampling and Cost-Sensitive Learning;177
6.3.7.2;8.6.2 Naive Bayes Comparisons on Low-Dimensional Gaussian Data;178
6.3.7.2.1;8.6.2.1 Gaussian Naive Bayes on Artificial, Low-Dimensional Data;179
6.3.7.2.2;8.6.2.2 A Note on ROC and AUC;180
6.3.7.2.3;8.6.2.3 Gaussian Naive Bayes on Real, Low-Dimensional Data;1
6.3.7.3;8.6.3 Multinomial Naive Bayes;183
6.3.7.4;8.6.4 SVMs;185
6.3.7.5;8.6.5 Discussion;188
6.3.8;8.7 Conclusion;189
6.3.9;Appendix;190
6.3.10;References;197
6.4;9 The Impact of Small Disjuncts on Classifier Learning;200
6.4.1;Gary M. Weiss;200
6.4.2;9.1 Introduction;200
6.4.3;9.2 An Example: The Vote Data Set;202
6.4.4;9.3 Description of Experiments;204
6.4.5;9.4 The Problem with Small Disjuncts;205
6.4.6;9.5 The Effect of Pruning on Small Disjuncts;209
6.4.7;9.6 The Effect of Training Set Size on Small Disjuncts;217
6.4.8;9.7 The Effect of Noise on Small Disjuncts;220
6.4.9;9.8 The Effect of Class Imbalance on Small Disjuncts;224
6.4.10;9.9 Related Work;227
6.4.11;9.10 Conclusion;230
6.4.12;References;232
7;Part IV Hybrid Data Mining Procedures;234
7.1;10 Predicting Customer Loyalty Labels in a Large Retail Database: A Case Study in Chile;235
7.1.1;Cristián J. Figueroa;235
7.1.2;10.1 Introduction;235
7.1.3;10.2 Related Work;237
7.1.4;10.3 Objectives of the Study;239
7.1.4.1;10.3.1 Supervised and Unsupervised Learning;240
7.1.4.2;10.3.2 Unsupervised Algorithms;240
7.1.4.2.1;10.3.2.1 Self-Organizing Map;240
7.1.4.2.2;10.3.2.2 Sammon Mapping;242
7.1.4.2.3;10.3.2.3 Curvilinear Component Analysis;243
7.1.4.3;10.3.3 Variables for Segmentation;244
7.1.4.4;10.3.4 Exploratory Data Analysis;245
7.1.4.5;10.3.5 Results of the Segmentation;246
7.1.5;10.4 Results of the Classifier;247
7.1.6;10.5 Business Validation;250
7.1.6.1;10.5.1 In-Store Minutes Charges for Prepaid Cell Phones;251
7.1.6.2;10.5.2 Distribution of Products in the Store;252
7.1.7;10.6 Conclusions and Discussion;254
7.1.8;Appendix;256
7.1.9;References;258
7.2;11 PCA-Based Time Series Similarity Search;260
7.2.1;Leonidas Karamitopoulos, Georgios Evangelidis, and Dimitris Dervos;260
7.2.2;11.1 Introduction;261
7.2.3;11.2 Background;263
7.2.3.1;11.2.1 Review of PCA;263
7.2.3.2;11.2.2 Implications of PCA in Similarity Search;264
7.2.3.3;11.2.3 Related Work;266
7.2.4;11.3 Proposed Approach;268
7.2.5;11.4 Experimental Methodology;270
7.2.5.1;11.4.1 Data Sets;270
7.2.5.2;11.4.2 Evaluation Methods;271
7.2.5.3;11.4.3 Rival Measures;272
7.2.6;11.5 Results;273
7.2.6.1;11.5.1 1-NN Classification;273
7.2.6.2;11.5.2 k-NN Similarity Search;276
7.2.6.3;11.5.3 Speeding Up the Calculation of APEdist;277
7.2.7;11.6 Conclusion;279
7.2.8;References;279
7.3;12 Evolutionary Optimization of Least-Squares Support Vector Machines;282
7.3.1;Arjan Gijsberts, Giorgio Metta, and Léon Rothkrantz;282
7.3.2;12.1 Introduction;283
7.3.3;12.2 Kernel Machines;283
7.3.3.1;12.2.1 Least-Squares Support Vector Machines;284
7.3.3.2;12.2.2 Kernel Functions;285
7.3.3.2.1;12.2.2.1 Conditions for Kernels;285
7.3.4;12.3 Evolutionary Computation;286
7.3.4.1;12.3.1 Genetic Algorithms;286
7.3.4.2;12.3.2 Evolution Strategies;287
7.3.4.3;12.3.3 Genetic Programming;288
7.3.5;12.4 Related Work;288
7.3.5.1;12.4.1 Hyperparameter Optimization;289
7.3.5.2;12.4.2 Combined Kernel Functions;289
7.3.6;12.5 Evolutionary Optimization of Kernel Machines;291
7.3.6.1;12.5.1 Hyperparameter Optimization;291
7.3.6.2;12.5.2 Kernel Construction;292
7.3.6.3;12.5.3 Objective Function;293
7.3.7;12.6 Results;294
7.3.7.1;12.6.1 Data Sets;294
7.3.7.2;12.6.2 Results for Hyperparameter Optimization;295
7.3.7.3;12.6.3 Results for EvoKMGP;298
7.3.8;12.7 Conclusions and Future Work;299
7.3.9;References;300
7.4;13 Genetically Evolved kNN Ensembles;303
7.4.1;Ulf Johansson, Rikard König, and Lars Niklasson;303
7.4.2;13.1 Introduction;303
7.4.3;13.2 Background and Related Work;305
7.4.4;13.3 Method;306
7.4.4.1;13.3.1 Data sets;309
7.4.5;13.4 Results;311
7.4.6;13.5 Conclusions;316
7.4.7;References;317
8;Part V Web-Mining;318
8.1;14 Behaviorally Founded Recommendation Algorithm for Browsing Assistance Systems;319
8.1.1;Peter Géczy, Noriaki Izumi, Shotaro Akaho, and Kôiti Hasida;319
8.1.2;14.1 Introduction;319
8.1.2.1;14.1.1 Related Works;320
8.1.2.2;14.1.2 Our Contribution and Approach;321
8.1.3;14.2 Concept Formalization;321
8.1.4;14.3 System Design;325
8.1.4.1;14.3.1 A Priori Knowledge of Human--System Interactions;325
8.1.4.2;14.3.2 Strategic Design Factors;325
8.1.4.3;14.3.3 Recommendation Algorithm Derivation;327
8.1.5;14.4 Practical Evaluation;329
8.1.5.1;14.4.1 Intranet Portal;330
8.1.5.2;14.4.2 System Evaluation;332
8.1.5.3;14.4.3 Practical Implications and Limitations;333
8.1.6;14.5 Conclusions and Future Work;334
8.1.7;References;335
8.2;15 Using Web Text Mining to Predict Future Events: A Testof the Wisdom of Crowds Hypothesis;337
8.2.1;Scott Ryan and Lutz Hamel;337
8.2.2;15.1 Introduction;337
8.2.3;15.2 Method;339
8.2.3.1;15.2.1 Hypotheses and Goals;339
8.2.3.2;15.2.2 General Methodology;341
8.2.3.3;15.2.3 The 2006 Congressional and Gubernatorial Elections;341
8.2.3.4;15.2.4 Sporting Events and Reality Television Programs;342
8.2.3.5;15.2.5 Movie Box Office Receipts and Music Sales;343
8.2.3.6;15.2.6 Replication;344
8.2.4;15.3 Results and Discussion;345
8.2.4.1;15.3.1 The 2006 Congressional and Gubernatorial Elections;345
8.2.4.2;15.3.2 Sporting Events and Reality Television Programs;347
8.2.4.3;15.3.3 Movie and Music Album Results;349
8.2.5;15.4 Conclusion;350
8.2.6;References;351
9;Part VI Privacy-Preserving Data Mining;353
9.1;16 Avoiding Attribute Disclosure with the (Extended) p-Sensitive k-Anonymity Model;354
9.1.1;Traian Marius Truta and Alina Campan;354
9.1.2;16.1 Introduction;354
9.1.3;16.2 Privacy Models and Algorithms;355
9.1.3.1;16.2.1 The p-Sensitive k-Anonymity Model and Its Extension;355
9.1.3.2;16.2.2 Algorithms for the p-Sensitive k-Anonymity Model;358
9.1.4;16.3 Experimental Results;361
9.1.4.1;16.3.1 Experiments for p-Sensitive k-Anonymity;361
9.1.4.2;16.3.2 Experiments for Extended p-Sensitive k-Anonymity;363
9.1.5;16.4 New Enhanced Models Based on p-Sensitive k-Anonymity;367
9.1.5.1;16.4.1 Constrained p-Sensitive k-Anonymity;367
9.1.5.2;16.4.2 p-Sensitive k-Anonymity in Social Networks;371
9.1.6;16.5 Conclusions and Future Work;373
9.1.7;References;373
9.2;17 Privacy-Preserving Random Kernel Classification of Checkerboard Partitioned Data;375
9.2.1;Olvi L. Mangasarian and Edward W. Wild;375
9.2.2;17.1 Introduction;375
9.2.3;17.2 Privacy-Preserving Linear Classifier for Checkerboard Partitioned Data;379
9.2.4;17.3 Privacy-Preserving Nonlinear Classifier for Checkerboard Partitioned Data;381
9.2.5;17.4 Computational Results;382
9.2.6;17.5 Conclusion and Outlook;384
9.2.7;References;386




