E-Book, Englisch, 385 Seiten
Fernández / García / Galar Learning from Imbalanced Data Sets
1. Auflage 2018
ISBN: 978-3-319-98074-4
Verlag: Springer International Publishing
Format: PDF
Kopierschutz: 1 - PDF Watermark
E-Book, Englisch, 385 Seiten
ISBN: 978-3-319-98074-4
Verlag: Springer International Publishing
Format: PDF
Kopierschutz: 1 - PDF Watermark
This book provides a general and comprehensible overview of imbalanced learning. It contains a formal description of a problem, and focuses on its main features, and the most relevant proposed solutions. Additionally, it considers the different scenarios in Data Science for which the imbalanced classification can create a real challenge. This book stresses the gap with standard classification tasks by reviewing the case studies and ad-hoc performance metrics that are applied in this area. It also covers the different approaches that have been traditionally applied to address the binary skewed class distribution. Specifically, it reviews cost-sensitive learning, data-level preprocessing methods and algorithm-level solutions, taking also into account those ensemble-learning solutions that embed any of the former alternatives. Furthermore, it focuses on the extension of the problem for multi-class problems, where the former classical methods are no longer to be applied in a straightforward way.This book also focuses on the data intrinsic characteristics that are the main causes which, added to the uneven class distribution, truly hinders the performance of classification algorithms in this scenario. Then, some notes on data reduction are provided in order to understand the advantages related to the use of this type of approaches.Finally this book introduces some novel areas of study that are gathering a deeper attention on the imbalanced data issue. Specifically, it considers the classification of data streams, non-classical classification problems, and the scalability related to Big Data. Examples of software libraries and modules to address imbalanced classification are provided.This book is highly suitable for technical professionals, senior undergraduate and graduate students in the areas of data science, computer science and engineering. It will also be useful for scientists and researchers to gain insight on the current developments in this area of study, as well as future research directions.
Autoren/Hrsg.
Weitere Infos & Material
1;Preface;6
2;Contents;8
3;Acronyms;15
4;1 Introduction to KDD and Data Science;17
4.1;1.1 Introduction;17
4.2;1.2 A Definition of Data Science;19
4.3;1.3 The Data Science Process;20
4.3.1;1.3.1 Selection of the Data;22
4.3.2;1.3.2 Data Preprocessing;23
4.3.2.1;1.3.2.1 Why Is Preprocessing Required?;23
4.3.3;1.3.3 Stages of the Data Preprocessing Phase;24
4.3.3.1;1.3.3.1 Selection of Data;25
4.3.3.2;1.3.3.2 Exploration of Data;26
4.3.3.3;1.3.3.3 Transformation of Data;27
4.4;1.4 Standard Data Science Problems;27
4.4.1;1.4.1 Descriptive Problems;27
4.4.2;1.4.2 Predictive Problems;28
4.5;1.5 Classical Data Mining Techniques;29
4.6;1.6 Non-standard Data Science Problems;30
4.6.1;1.6.1 Derivative Problems;30
4.6.1.1;1.6.1.1 Imbalanced Learning;30
4.6.1.2;1.6.1.2 Multi-instance Learning;31
4.6.1.3;1.6.1.3 Multi-label Classification;31
4.6.1.4;1.6.1.4 Data Stream Learning;31
4.6.2;1.6.2 Hybrid Problems;31
4.6.2.1;1.6.2.1 Semi-supervised Learning;31
4.6.2.2;1.6.2.2 Subgroup Discovery;31
4.6.2.3;1.6.2.3 Ordinal Classification/Regression;32
4.6.2.4;1.6.2.4 Transfer Learning;32
4.7;References;32
5;2 Foundations on Imbalanced Classification;34
5.1;2.1 Formal Description;34
5.2;2.2 Applications;39
5.2.1;2.2.1 Engineering;42
5.2.2;2.2.2 Information Technology;43
5.2.3;2.2.3 Bioinformatics;45
5.2.4;2.2.4 Medicine;46
5.2.4.1;2.2.4.1 Quality Control;46
5.2.4.2;2.2.4.2 Medical Diagnosis;47
5.2.4.3;2.2.4.3 Medical Prognosis;49
5.2.5;2.2.5 Business Management;50
5.2.6;2.2.6 Security;50
5.2.7;2.2.7 Education;51
5.3;2.3 Case Studies on Imbalanced Classification;51
5.4;References;56
6;3 Performance Measures;62
6.1;3.1 Introduction;62
6.2;3.2 Nominal Class Predictions;63
6.3;3.3 Scoring Predictions;68
6.4;3.4 Probabilistic Predictions;72
6.5;3.5 Summarizing Comments;73
6.6;References;74
7;4 Cost-Sensitive Learning;77
7.1;4.1 Introduction;77
7.2;4.2 Obtaining the Cost Matrix;80
7.3;4.3 MetaCost;82
7.4;4.4 Cost-Sensitive Decision Trees;83
7.4.1;4.4.1 Direct Approach with Cost-Sensitive Splitting;84
7.4.2;4.4.2 Meta-learning Approach with Instance Weighting;85
7.5;4.5 Other Cost-Sensitive Classifiers;86
7.5.1;4.5.1 Support Vector Machines;86
7.5.2;4.5.2 Artificial Neural Networks;87
7.5.3;4.5.3 Nearest Neighbors;87
7.6;4.6 Hybrid Cost-Sensitive Approaches;87
7.7;4.7 Summarizing Comments;88
7.8;References;89
8;5 Data Level Preprocessing Methods;93
8.1;5.1 Introduction;93
8.2;5.2 Undersampling and Oversampling Basics;96
8.3;5.3 Advanced Undersampling Techniques;100
8.3.1;5.3.1 Evolutionary Undersampling;101
8.3.1.1;5.3.1.1 ACOSampling;103
8.3.1.2;5.3.1.2 IPADE-ID;104
8.3.1.3;5.3.1.3 CBEUS: Cluster-Based Evolutionary Undersampling;105
8.3.2;5.3.2 Undersampling by Cleaning Data;106
8.3.2.1;5.3.2.1 Weighted Sampling;106
8.3.2.2;5.3.2.2 IHT: Instance Hardness Threshold;106
8.3.2.3;5.3.2.3 Hybrid Undersampling;108
8.3.3;5.3.3 Ensemble Based Undersampling;108
8.3.3.1;5.3.3.1 IRUS: Inverse Random Undersampling;109
8.3.3.2;5.3.3.2 OligoIS: Oligarchic Instance Selection;110
8.3.4;5.3.4 Clustering Based Undersampling;110
8.3.4.1;5.3.4.1 ClusterOSS;111
8.3.4.2;5.3.4.2 DSUS: Diversified Sensitivity Undersampling;111
8.4;5.4 Synthetic Minority Oversampling TEchnique (SMOTE);112
8.5;5.5 Extensions of SMOTE;115
8.5.1;5.5.1 Borderline-SMOTE;115
8.5.2;5.5.2 Adjusting the Direction of the Synthetic Minority ClasS Examples: ADOMS;117
8.5.3;5.5.3 ADASYN: Adaptive Synthetic Sampling Approach;118
8.5.3.1;Input;119
8.5.3.2;Procedure;119
8.5.4;5.5.4 ROSE: Random Oversampling Examples;120
8.5.5;5.5.5 Safe-Level-SMOTE;122
8.5.6;5.5.6 DBSMOTE: Density-Based SMOTE;122
8.5.7;5.5.7 MWMOTE: Majority Weighted Minority Oversampling TEchnique;124
8.5.7.1;Input;126
8.5.7.2;Procedure;126
8.5.8;5.5.8 MDO: Mahalanobis Distance-Based Oversampling Technique;128
8.6;5.6 Hybridizations of Undersampling and Oversampling;128
8.7;5.7 Summarizing Comments;131
8.8;References;131
9;6 Algorithm-Level Approaches;136
9.1;6.1 Introduction;136
9.2;6.2 Support Vector Machines;137
9.2.1;6.2.1 Kernel Modifications;140
9.2.1.1;6.2.1.1 Kernel Boundary and Margin Shift;140
9.2.1.2;6.2.1.2 Kernel Target Alignment;141
9.2.1.3;6.2.1.3 Kernel Scaling;141
9.2.2;6.2.2 Weighted Approaches;142
9.2.2.1;6.2.2.1 Instance Weighting;142
9.2.2.2;6.2.2.2 Support Vector Weighting;143
9.2.2.3;6.2.2.3 Fuzzy Approaches;144
9.2.3;6.2.3 Active Learning;146
9.3;6.3 Decision Trees;147
9.4;6.4 Nearest Neighbor Classifiers;149
9.5;6.5 Bayesian Classifiers;151
9.6;6.6 One-Class Classifiers;152
9.7;6.7 Summarizing Comments;154
9.8;References;154
10;7 Ensemble Learning;160
10.1;7.1 Introduction;160
10.2;7.2 Foundations on Ensemble Learning;161
10.2.1;7.2.1 Bagging;165
10.2.2;7.2.2 Boosting;168
10.2.3;7.2.3 Techniques to Increase Diversity in Classifier Ensembles;173
10.3;7.3 Ensemble Learning for Addressing the Class Imbalance Problem;174
10.3.1;7.3.1 Cost-Sensitive Boosting;176
10.3.1.1;7.3.1.1 AdaCost;178
10.3.1.2;7.3.1.2 CSB;179
10.3.1.3;7.3.1.3 RareBoost;179
10.3.1.4;7.3.1.4 AdaC1;180
10.3.1.5;7.3.1.5 AdaC2;180
10.3.1.6;7.3.1.6 AdaC3;181
10.3.2;7.3.2 Ensembles with Cost-Sensitive Base Classifiers;181
10.3.2.1;7.3.2.1 BoostedCS-SVM;181
10.3.2.2;7.3.2.2 BoostedWeightedELM;182
10.3.2.3;7.3.2.3 CS-DT-Ensemble;182
10.3.2.4;7.3.2.4 BayEnsBNN;182
10.3.2.5;7.3.2.5 AL-BoostedCS-SVM;183
10.3.2.6;7.3.2.6 IC-BoostedCS-SVM;183
10.3.3;7.3.3 Boosting-Based Ensembles;183
10.3.3.1;7.3.3.1 SMOTEBoost/MSMOTEBoost;183
10.3.3.2;7.3.3.2 RUSBoost;184
10.3.3.3;7.3.3.3 DataBoost-IM;184
10.3.3.4;7.3.3.4 RAMOBoost;185
10.3.3.5;7.3.3.5 Adaboost.NC;185
10.3.3.6;7.3.3.6 EUSBoost;186
10.3.3.7;7.3.3.7 GESuperPBoost;186
10.3.3.8;7.3.3.8 BalancedBoost;186
10.3.3.9;7.3.3.9 RB-Boost;186
10.3.3.10;7.3.3.10 Balanced-St-GrBoost;187
10.3.4;7.3.4 Bagging-Based Ensembles;187
10.3.4.1;7.3.4.1 OverBagging;188
10.3.4.2;7.3.4.2 UnderBagging;188
10.3.4.3;7.3.4.3 UnderOverBagging;189
10.3.4.4;7.3.4.4 IIVotes;190
10.3.4.5;7.3.4.5 RB-Bagging;190
10.3.4.6;7.3.4.6 EPRENNID;190
10.3.4.7;7.3.4.7 USwitchingNED;191
10.3.5;7.3.5 Hybrid Ensembles;191
10.3.5.1;7.3.5.1 EasyEnsemble;192
10.3.5.2;7.3.5.2 BalanceCascade;192
10.3.5.3;7.3.5.3 HardEnsemble;192
10.3.5.4;7.3.5.4 StochasticEnsemble;193
10.3.6;7.3.6 Other;193
10.3.6.1;7.3.6.1 MOGP-GP;193
10.3.6.2;7.3.6.2 RandomOracles;193
10.3.6.3;7.3.6.3 Loss Factors;194
10.3.6.4;7.3.6.4 GOBoost;194
10.3.6.5;7.3.6.5 OrderingBasedPruning;194
10.3.6.6;7.3.6.6 Diversity Enhancing Techniques for Improving Ensembles;195
10.3.6.7;7.3.6.7 PT-Bagging;195
10.3.6.8;7.3.6.8 IMCStacking;195
10.3.6.9;7.3.6.9 DynamicSelection;196
10.4;7.4 An Illustrative Experimental Study on Ensembles for the Class Imbalance Problem;196
10.4.1;7.4.1 Experimental Framework;197
10.4.1.1;7.4.1.1 Datasets and Performance Measures;197
10.4.1.2;7.4.1.2 Algorithms and Parameters;197
10.4.1.3;7.4.1.3 Statistical Analysis;199
10.4.2;7.4.2 Experimental Results and Discussion;199
10.5;7.5 Summarizing Contents;203
10.6;References;204
11;8 Imbalanced Classification with Multiple Classes;210
11.1;8.1 Introduction;210
11.2;8.2 Multi-class Imbalanced Learning via Decomposition-Based Approaches;212
11.2.1;8.2.1 Reducing Multi-class Problems by Binarization Techniques;212
11.2.1.1;8.2.1.1 The One-vs-One Scheme (OVO);212
11.2.1.2;8.2.1.2 The One-vs-All Scheme (OVA);213
11.2.2;8.2.2 Binary Imbalanced Approaches for Multi-class Problems;214
11.2.3;8.2.3 Discussion on the Capabilities of Decomposition Strategies;217
11.3;8.3 Ad-hoc Approaches for Multi-class Imbalanced Classification;219
11.3.1;8.3.1 Multi-class Preprocessing Techniques;219
11.3.2;8.3.2 Algorithmic Solutions on Multi-class;220
11.3.3;8.3.3 Multi-class Cost-Sensitive Learning;222
11.3.4;8.3.4 Ensemble Approaches;223
11.3.5;8.3.5 Summary and Future Prospects on Ad-hoc Approaches;225
11.3.5.1;8.3.5.1 Preprocessing Techniques;225
11.3.5.2;8.3.5.2 Algorithmic Approaches;226
11.3.5.3;8.3.5.3 Cost-Sensitive Learning;226
11.3.5.4;8.3.5.4 Ensemble Systems;226
11.4;8.4 Performance Metrics in Multi-class Imbalanced Problems;226
11.5;8.5 A Brief Experimental Analysis for Imbalanced Multi-class Problems;230
11.5.1;8.5.1 Experimental Setup;230
11.5.2;8.5.2 Experimental Results and Discussion;232
11.6;8.6 Summarizing Comments;234
11.7;References;234
12;9 Dimensionality Reduction for Imbalanced Learning;240
12.1;9.1 Introduction;240
12.2;9.2 Feature Selection;242
12.2.1;9.2.1 Studies of Classical Feature Selection in Imbalance Learning;243
12.2.2;9.2.2 Ad-hoc Feature Selection Techniques for Tackling Imbalance Classification;245
12.2.2.1;9.2.2.1 Feature Selection with Biased Sample Distribution;246
12.2.2.2;9.2.2.2 Combating the Small Sample Class Imbalance Problem Using Feature Selection;248
12.2.2.3;9.2.2.3 Discriminative Feature Selection by Nonparametric Bayes Error Minimization;249
12.2.2.4;9.2.2.4 Feature Selection for High-Dimensional Imbalanced Data;249
12.2.2.5;9.2.2.5 Iterative Feature Selection;251
12.3;9.3 Advanced Feature Selection;252
12.3.1;9.3.1 Ensemble and Wrapper-Based Techniques;252
12.3.2;9.3.2 Evolutionary-Based Techniques;253
12.4;9.4 Linear Models for Feature Extraction;253
12.4.1;9.4.1 Asymmetric Principal Component Analysis;254
12.4.2;9.4.2 Extraction of Minimum Positive and Maximum Negative Features;256
12.4.2.1;9.4.2.1 Model 1;257
12.4.2.2;9.4.2.2 Model 2;258
12.5;9.5 Non-linear Models for Feature Extraction: Autoencoders;258
12.6;9.6 Discretization in Imbalanced Data: ur-CAIM;261
12.7;9.7 Summarizing Comments;262
12.8;References;263
13;10 Data Intrinsic Characteristics;265
13.1;10.1 Introduction;265
13.2;10.2 Data Complexity for Imbalanced Datasets;266
13.3;10.3 Sub-concepts and Small-Disjuncts;267
13.4;10.4 Lack of Data;273
13.5;10.5 Overlapping and Separability;274
13.6;10.6 Noisy Data;276
13.7;10.7 Borderline Examples;279
13.8;10.8 Dataset Shift;282
13.9;10.9 Imperfect Data;284
13.10;10.10 Summarizing Comments;285
13.11;References;285
14;11 Learning from Imbalanced Data Streams;290
14.1;11.1 Introduction;290
14.2;11.2 Characteristics of Imbalanced Data Streams;295
14.3;11.3 Data-Level and Algorithm-Level Approaches;298
14.3.1;11.3.1 Undersampling Naïve Bayes;298
14.3.2;11.3.2 Generalized Over-sampling Based Online Imbalanced Learning Framework (GOS-IL);299
14.3.3;11.3.3 Sequential SMOTE;299
14.3.4;11.3.4 Recursive Least Square Perceptron Model (RLSACP) and Online Neural Network for Non-stationary and Imbalanced Data Streams (ONN);299
14.3.5;11.3.5 Dynamic Class Imbalance for Linear Proximal SVMs (DCIL-IncLPSVM);300
14.3.6;11.3.6 Kernelized Online Imbalanced Learning (KOIL);300
14.3.7;11.3.7 Gaussian Hellinger Very Fast Decision Tree (GH-VFDT);300
14.3.8;11.3.8 Cost-Sensitive Fast Perceptron Tree (CSPT);301
14.4;11.4 Ensemble Learning Approaches;302
14.4.1;11.4.1 Stream Ensemble Framework (SE);302
14.4.2;11.4.2 Selectively Recursive Approach (SERA);303
14.4.3;11.4.3 Recursive Ensemble Approach (REA);303
14.4.4;11.4.4 Boundary Definition Ensemble (BD);303
14.4.5;11.4.5 Learn++.CDC (Concept Drift with SMOTE);304
14.4.6;11.4.6 Ensemble of Online Cost-Sensitive Neural Networks (EONN);304
14.4.7;11.4.7 Ensemble of Subset Online Sequential Extreme Learning Machines (ESOS-ELM);304
14.4.8;11.4.8 Oversampling- and Undersampling-Based Online Bagging (OOB and UOB);304
14.4.9;11.4.9 Dynamic Weighted Majority for Imbalance Learning (DWMIL);305
14.4.10;11.4.10 Gradual Resampling Ensemble (GRE);305
14.5;11.5 Evolving Number of Classes;305
14.5.1;11.5.1 Learn++.NovelClass (Learn++.NC);306
14.5.2;11.5.2 Enhanced Classifier for Data Streams with Novel Class Miner (ECSMiner);306
14.5.3;11.5.3 Multiclass Miner in Data Streams (MCM);306
14.5.4;11.5.4 AnyNovel;307
14.5.5;11.5.5 Class-Based Ensemble for Class Evolution (CBCE);307
14.5.6;11.5.6 Class Based Micro Classifier Ensemble (CLAM) and Stream Classifier And Novel and Recurring Class Detector (SCARN);307
14.6;11.6 Access to Ground Truth;308
14.6.1;11.6.1 Online Active Learning with Bayesian Probit;308
14.6.2;11.6.2 Online Mean Score on Unlabeled Set (Online-MSU);309
14.6.3;11.6.3 Cost-Sensitive Online Active Learning Under a Query Budget (CSOAL);309
14.6.4;11.6.4 Online Active Learning with the Asymmetric Query Model;309
14.6.5;11.6.5 Genetic Programming Active Learning Framework (Stream-GP);309
14.7;11.7 Summarizing Comments;310
14.8;References;311
15;12 Non-classical Imbalanced Classification Problems;315
15.1;12.1 Introduction;315
15.2;12.2 Semi-supervised Learning;316
15.2.1;12.2.1 Inductive Semi-supervised Learning;316
15.2.2;12.2.2 Transductive Learning;317
15.2.3;12.2.3 PU-Learning;318
15.2.4;12.2.4 Active Learning;318
15.3;12.3 Multilabel Learning;319
15.3.1;12.3.1 Imbalance Quantification;320
15.3.2;12.3.2 Methods for Dealing with Imbalance in MLL;321
15.3.2.1;12.3.2.1 Resampling;321
15.3.2.2;12.3.2.2 Algorithm Adaptation;322
15.3.2.3;12.3.2.3 Ensemble Learning;323
15.4;12.4 Multi-instance Learning;324
15.4.1;12.4.1 Methods for Dealing with Imbalance in MIL;325
15.4.1.1;12.4.1.1 Resampling;325
15.4.1.2;12.4.1.2 Problem Adaptation;326
15.4.1.3;12.4.1.3 Ensembles;326
15.5;12.5 Ordinal Classification and Regression;327
15.5.1;12.5.1 Imbalanced Regression;328
15.5.1.1;12.5.1.1 Under-sampling for Regression;329
15.5.1.2;12.5.1.2 SMOTE for Regression;330
15.5.2;12.5.2 Ordinal Classification of Imbalanced Data;330
15.5.2.1;12.5.2.1 Graph-Based Over-sampling;331
15.5.2.2;12.5.2.2 Cluster-Based Weighted Over-sampling;331
15.6;12.6 Summarizing Comments;331
15.7;References;332
16;13 Imbalanced Classification for Big Data;336
16.1;13.1 Introduction;336
16.2;13.2 Big Data: MapReduce Programming Model, Spark Framework and Machine Learning Libraries;338
16.2.1;13.2.1 Introduction to Big Data and MapReduce;338
16.2.2;13.2.2 Spark: A Novel Technological Approach for Iterative Processing in Big Data;340
16.2.3;13.2.3 Machine Learning Libraries for Big Data;342
16.2.3.1;13.2.3.1 Hadoop: Apache Mahout;342
16.2.3.2;13.2.3.2 Spark: MLlib and SparkPackages;342
16.3;13.3 Addressing Imbalanced Classification in Big Data Problems: Current State;343
16.3.1;13.3.1 Data Pre-processing Studies;344
16.3.1.1;13.3.1.1 Traditional Data Based Solutions for Big Data;344
16.3.1.2;13.3.1.2 Random OverSampling with Evolutionary Feature Weighting and Random Forest (ROSEFW-RF);345
16.3.1.3;13.3.1.3 Evolutionary Undersampling;346
16.3.1.4;13.3.1.4 Data Cleaning;346
16.3.1.5;13.3.1.5 NRSBoundary-SMOTE;346
16.3.1.6;13.3.1.6 Extreme Learning Machine with Resampling;347
16.3.1.7;13.3.1.7 Multi-class Imbalance;347
16.3.1.8;13.3.1.8 Summary;348
16.3.2;13.3.2 Cost-Sensitive Learning Studies;348
16.3.2.1;13.3.2.1 Cost-Sensitive SVM;348
16.3.2.2;13.3.2.2 Instance Weighting SVM;348
16.3.2.3;13.3.2.3 Cost-Sensitive Random Forest;349
16.3.2.4;13.3.2.4 Cost-Sensitive Fuzzy Rule Based Classification System (FRBCS);349
16.3.2.5;13.3.2.5 Summary;350
16.3.3;13.3.3 Applications on Imbalanced Big Data;350
16.3.3.1;13.3.3.1 Pairwise Ortholog Detection;350
16.3.3.2;13.3.3.2 Traffic Accidents Prediction;351
16.3.3.3;13.3.3.3 Biomedical Data;351
16.3.3.4;13.3.3.4 Human Activity Recognition;352
16.3.3.5;13.3.3.5 Fraud Detection;352
16.3.3.6;13.3.3.6 Summary;352
16.4;13.4 Challenges for Imbalanced Big Data Classification;353
16.5;13.5 Summarizing Comments;354
16.6;References;355
17;14 Software and Libraries for Imbalanced Classification;359
17.1;14.1 Introduction;359
17.2;14.2 Java Tools;360
17.2.1;14.2.1 KEEL Software Suite;361
17.2.2;14.2.2 Weka;363
17.3;14.3 R Packages;366
17.3.1;14.3.1 Package Unbalanced;366
17.3.2;14.3.2 Package Smotefamily;368
17.3.3;14.3.3 Package ROSE;369
17.3.4;14.3.4 Package DMwR;370
17.3.5;14.3.5 Package Imbalance;371
17.3.6;14.3.6 Package mlr: Cost-Sensitive Classification;375
17.4;14.4 Python Libraries;377
17.5;14.5 Big Data Software: Spark Packages;379
17.6;14.6 Summarizing Comments;382
17.7;References;383




