E-Book, Englisch, Band 43, 350 Seiten
Triantaphyllou Data Mining and Knowledge Discovery via Logic-Based Methods
1. Auflage 2010
ISBN: 978-1-4419-1630-3
Verlag: Springer
Format: PDF
Kopierschutz: 1 - PDF Watermark
Theory, Algorithms, and Applications
E-Book, Englisch, Band 43, 350 Seiten
Reihe: Springer Optimization and Its Applications
ISBN: 978-1-4419-1630-3
Verlag: Springer
Format: PDF
Kopierschutz: 1 - PDF Watermark
The importance of having ef cient and effective methods for data mining and kn- ledge discovery (DM&KD), to which the present book is devoted, grows every day and numerous such methods have been developed in recent decades. There exists a great variety of different settings for the main problem studied by data mining and knowledge discovery, and it seems that a very popular one is formulated in terms of binary attributes. In this setting, states of nature of the application area under consideration are described by Boolean vectors de ned on some attributes. That is, by data points de ned in the Boolean space of the attributes. It is postulated that there exists a partition of this space into two classes, which should be inferred as patterns on the attributes when only several data points are known, the so-called positive and negative training examples. The main problem in DM&KD is de ned as nding rules for recognizing (cl- sifying) new data points of unknown class, i. e. , deciding which of them are positive and which are negative. In other words, to infer the binary value of one more attribute, called the goal or class attribute. To solve this problem, some methods have been suggested which construct a Boolean function separating the two given sets of positive and negative training data points.
Autoren/Hrsg.
Weitere Infos & Material
1;Foreword;7
2;Preface;11
3;Acknowledgments;16
4;List of Figures;25
5;List of Tables;29
6;Part I Algorithmic Issues;32
6.1;Introduction;33
6.1.1;What Is Data Mining and Knowledge Discovery?;33
6.1.2;Some Potential Application Areas for Data Mining and Knowledge Discovery;34
6.1.2.1;Applications in Engineering;35
6.1.2.2;Applications in Medical Sciences;35
6.1.2.3;Applications in the Basic Sciences;36
6.1.2.4;Applications in Business;36
6.1.2.5;Applications in the Political and Social Sciences;37
6.1.3;The Data Mining and Knowledge Discovery Process;37
6.1.3.1;Problem Definition;37
6.1.3.2;Collecting the Data;39
6.1.3.3;Data Preprocessing;40
6.1.3.4;Application of the Main Data Mining and Knowledge Discovery Algorithms;41
6.1.3.5;Interpretation of the Results of the Data Mining and Knowledge Discovery Process;42
6.1.4;Four Key Research Challenges in Data Mining and Knowledge Discovery;42
6.1.4.1;Collecting Observations about the Behavior of the System;43
6.1.4.2;Identifying Patterns from Collections of Data;44
6.1.4.3;Which Data to Consider for Evaluation Next?;47
6.1.4.4;Do Patterns Always Exist in Data?;49
6.1.5;Concluding Remarks;50
6.2;Inferring a Boolean Function from Positive and Negative Examples;51
6.2.1;An Introduction;51
6.2.2;Some Background Information;52
6.2.3;Data Binarization;56
6.2.4;Definitions and Terminology;59
6.2.5;Generating Clauses from Negative Examples Only;62
6.2.6;Clause Inference as a Satisfiability Problem;63
6.2.7;An SAT Approach for Inferring CNF Clauses;64
6.2.8;The One Clause At a Time (OCAT) Concept;65
6.2.9;A Branch-and-Bound Approach for Inferring a Single Clause;68
6.2.10;A Heuristic for Problem Preprocessing;75
6.2.11;Some Computational Results;77
6.2.12;Concluding Remarks;80
6.2.13;Appendix;82
6.3;A Revised Branch-and-Bound Approach for Inferring a Boolean Function from Examples;87
6.3.1;Some Background Information;87
6.3.2;The Revised Branch-and-Bound Algorithm;87
6.3.2.1;Generating a Single CNF Clause;88
6.3.2.2;Generating a Single DNF Clause;92
6.3.2.3;Some Computational Results;94
6.3.3;Concluding Remarks;99
6.4;Some Fast Heuristics for Inferring a Boolean Function from Examples;103
6.4.1;Some Background Information;103
6.4.2;A Fast Heuristic for Inferring a Boolean Function from Complete Data;105
6.4.3;A Fast Heuristic for Inferring a Boolean Function from Incomplete Data;110
6.4.4;Some Computational Results;114
6.4.4.1;Results for the RA1 Algorithm on the Wisconsin Cancer Data;116
6.4.4.2;Results for the RA2 Heuristic on the Wisconsin Cancer Data with Some Missing Values;121
6.4.4.3;Comparison of the RA1 Algorithm and the B&B Method Using Large Random Data Sets;122
6.4.5;Concluding Remarks;128
6.5;An Approach to Guided Learning of Boolean Functions;131
6.5.1;Some Background Information;131
6.5.2;Problem Description;134
6.5.3;The Proposed Approach;135
6.5.4;On the Number of Candidate Solutions;140
6.5.5;An Illustrative Example;141
6.5.6;Some Computational Results;143
6.5.7;Concluding Remarks;152
6.6;An Incremental Learning Algorithm for Inferring Boolean Functions;154
6.6.1;Some Background Information;154
6.6.2;Problem Description;155
6.6.3;Some Related Developments;156
6.6.4;The Proposed Incremental Algorithm;159
6.6.4.1;Repairing a Boolean Function that Incorrectly Rejects a Positive Example;160
6.6.4.2;Repairing of a Boolean Function that Incorrectly Accepts a Negative Example;162
6.6.4.3;Computational Complexity of the Algorithms for the ILE Approach;163
6.6.5;Experimental Data;163
6.6.6;Analysis of the Computational Results;164
6.6.6.1;Results on the Classification Accuracy;165
6.6.6.2;Results on the Number of Clauses;168
6.6.6.3;Results on the CPU Times;170
6.6.7;Concluding Remarks;173
6.7;A Duality Relationship Between Boolean Functions in CNF and DNF Derivable from the Same Training Examples;175
6.7.1;Introduction;175
6.7.2;Generating Boolean Functions in CNF and DNF Form;175
6.7.3;An Illustrative Example of Deriving Boolean Functions in CNF and DNF;176
6.7.4;Some Computational Results;177
6.7.5;Concluding Remarks;178
6.8;The Rejectability Graph of Two Sets of Examples;179
6.8.1;Introduction;179
6.8.2;The Definition of the Rejectability Graph;180
6.8.2.1;Properties of the Rejectability Graph;181
6.8.2.2;On the Minimum Clique Cover of the Rejectability Graph;183
6.8.3;Problem Decomposition;184
6.8.3.1;Connected Components;184
6.8.3.2;Clique Cover;185
6.8.4;An Example of Using the Rejectability Graph;186
6.8.5;Some Computational Results;188
6.8.6;Concluding Remarks;198
7;Part II Application Issues;199
7.1;The Reliability Issue in Data Mining: The Case of Computer-Aided Breast Cancer Diagnosis;200
7.1.1;Introduction;200
7.1.2;Some Background Information on Computer-Aided Breast Cancer Diagnosis;200
7.1.3;Reliability Criteria;202
7.1.4;The Representation/Narrow Vicinity Hypothesis;205
7.1.5;Some Computational Results;208
7.1.6;Concluding Remarks;210
7.1.7;Appendix I: Definitions of the Key Attributes;212
7.1.8;Appendix II: Technical Procedures;214
7.1.9;The Interactive Approach;214
7.1.10;The Hierarchical Approach;215
7.1.11;The Monotonicity Property;215
7.1.12;Logical Discriminant Functions;216
7.2;Data Mining and Knowledge Discovery by Means of Monotone Boolean Functions;218
7.2.1;Introduction;218
7.2.2;Background Information;220
7.2.2.1;Problem Descriptions;220
7.2.2.2;Hierarchical Decomposition of Attributes;223
7.2.2.3;Some Key Properties of Monotone Boolean Functions;224
7.2.2.4;Existing Approaches to Problem 1;228
7.2.2.5;An Existing Approach to Problem 2;230
7.2.2.6;Existing Approaches to Problem 3;231
7.2.2.7;Stochastic Models for Problem 3;231
7.2.3;Inference Objectives and Methodology;233
7.2.3.1;The Inference Objective for Problem 1;233
7.2.3.2;The Inference Objective for Problem 2;234
7.2.3.3;The Inference Objective for Problem 3;235
7.2.3.4;Incremental Updates for the Fixed Misclassification Probability Model;235
7.2.3.5;Selection Criteria for Problem 1;236
7.2.3.6;Selection Criteria for Problems 2.1, 2.2, and 2.3;237
7.2.3.7;Selection Criterion for Problem 3;237
7.2.4;Experimental Results;242
7.2.4.1;Experimental Results for Problem 1;242
7.2.4.2;Experimental Results for Problem 2;244
7.2.4.3;Experimental Results for Problem 3;246
7.2.5;Summary and Discussion;250
7.2.5.1;Summary of the Research Findings;250
7.2.5.2;Significance of the Research Findings;252
7.2.5.3;Future Research Directions;253
7.2.6;Concluding Remarks;254
7.3;Some Application Issues of Monotone Boolean Functions;255
7.3.1;Some Background Information;255
7.3.2;Expressing Any Boolean Function in Terms of Monotone Ones;255
7.3.3;Formulations of Diagnostic Problems as the Inference of Nested Monotone Boolean Functions;257
7.3.3.1;An Application to a Reliability Engineering Problem;257
7.3.3.2;An Application to the Breast Cancer Diagnosis Problem;258
7.3.4;Design Problems;259
7.3.5;Process Diagnosis Problems;260
7.3.6;Three Major Illusions in the Evaluation of the Accuracy of Data Mining Models;260
7.3.6.1;First Illusion: The Single Index Accuracy Rate;261
7.3.6.2;Second Illusion: Accurate Diagnosis without Hard Cases;261
7.3.6.3;Third Illusion: High Accuracy on Random Test Data Only;262
7.3.7;Identification of the Monotonicity Property;262
7.3.8;Concluding Remarks;265
7.4;Mining of Association Rules;266
7.4.1;Some Background Information;266
7.4.2;Problem Description;268
7.4.3;Methodology;269
7.4.3.1;Some Related Algorithmic Developments;269
7.4.3.2;Alterations to the RA1 Algorithm;270
7.4.4;Computational Experiments;272
7.4.5;Concluding Remarks;280
7.5;Data Mining of Text Documents;281
7.5.1;Some Background Information;281
7.5.2;A Brief Description of the Document Clustering Process;283
7.5.3;Using the OACT Approach to Classify Text Documents;284
7.5.4;An Overview of the Vector Space Model;286
7.5.5;A Guided Learning Approach for the Classification of Text Documents;288
7.5.6;Experimental Data;289
7.5.7;Testing Methodology;291
7.5.7.1;The Leave-One-Out Cross Validation;291
7.5.7.2;The 30/30 Cross Validation;291
7.5.7.3;Statistical Performance of Both Algorithms;291
7.5.7.4;Experimental Setting for the Guided Learning Approach;292
7.5.8;Results for the Leave-One-Out and the 30/30 Cross Validations;293
7.5.9;Results for the Guided Learning Approach;296
7.5.10;Concluding Remarks;299
7.6;First Case Study: Predicting Muscle Fatigue from EMG Signals;301
7.6.1;Introduction;301
7.6.2;General Problem Description;301
7.6.3;Experimental Data;303
7.6.4;Analysis of the EMG Data;304
7.6.4.1;The Effects of Load and Electrode Orientation;304
7.6.4.2;The Effects of Muscle Condition, Load, and Electrode Orientation;304
7.6.5;A Comparative Analysis of the EMG Data;305
7.6.5.1;Results by the OCAT/RA1 Approach;306
7.6.5.2;Results by Fisher's Linear Discriminant Analysis;307
7.6.5.3;Results by Logistic Regression;308
7.6.5.4;A Neural Network Approach;309
7.6.6;Concluding Remarks;311
7.7;Second Case Study: Inference of Diagnostic Rules for Breast Cancer;312
7.7.1;Introduction;312
7.7.2;Description of the Data Set;312
7.7.3;Description of the Inferred Rules;315
7.7.4;Concluding Remarks;319
7.8;A Fuzzy Logic Approach to Attribute Formalization: Analysis of Lobulation for Breast Cancer Diagnosis;320
7.8.1;Introduction;320
7.8.2;Some Background Information on Digital Mammography;320
7.8.3;Some Background Information on Fuzzy Sets;322
7.8.4;Formalization with Fuzzy Logic;323
7.8.5;Degrees of Lobularity and Microlobularity;329
7.8.6;Concluding Remarks;331
7.9;Conclusions;332
7.9.1;General Concluding Remarks;332
7.9.2;Twelve Key Areas of Potential Future Research on Data Mining and Knowledge Discovery from Databases;333
7.9.2.1;Overfitting and Overgeneralization;333
7.9.2.2;Guided Learning;334
7.9.2.3;Stochasticity;334
7.9.2.4;More on Monotonicity;334
7.9.2.5;Visualization;334
7.9.2.6;Systems for Distributed Computing Environments;335
7.9.2.7;Developing Better Exact Algorithms and Heuristics;335
7.9.2.8;Hybridization and Other Algorithmic Issues;335
7.9.2.9;Systems with Self-Explanatory Capabilities;336
7.9.2.10;New Systems for Image Analysis;336
7.9.2.11;Systems for Web Applications;336
7.9.2.12;Developing More Applications;337
7.9.3;Epilogue;337
7.10;References;339
7.11;Subject Index;356
7.12;Author Index;366
7.13;About the Author;370




