Buch, Englisch, 432 Seiten, Format (B × H): 180 mm x 257 mm, Gewicht: 975 g
A Concise Introduction
Buch, Englisch, 432 Seiten, Format (B × H): 180 mm x 257 mm, Gewicht: 975 g
ISBN: 978-1-394-32525-2
Verlag: Wiley
New edition of a PROSE award finalist title on core concepts for machine learning, updated with the latest developments in the field, now with Python and R source code side-by-side
Machine Learning is a comprehensive text on the core concepts, approaches, and applications of machine learning. It presents fundamental ideas, terminology, and techniques for solving applied problems in classification, regression, clustering, density estimation, and dimension reduction. New content for this edition includes chapter expansions which provide further computational and algorithmic insights to improve reader understanding. This edition also revises several chapters to account for developments since the prior edition.
In this book, the design principles behind the techniques are emphasized, including the bias-variance trade-off and its influence on the design of ensemble methods, enabling readers to solve applied problems more efficiently and effectively. This book also includes methods for optimization, risk estimation, model selection, and dealing with biased data samples and software limitations — essential elements of most applied projects.
Written by an expert in the field, this important resource: - Illustrates many classification methods with a single, running example, highlighting similarities and differences between methods
- Presents side-by-side Python and R source code which shows how to apply and interpret many of the techniques covered
- Includes many thoughtful exercises as an integral part of the text, with an appendix of selected solutions
- Contains useful information for effectively communicating with clients on both technical and ethical topics
- Details classification techniques including likelihood methods, prototype methods, neural networks, classification trees, and support vector machines
A volume in the popular Wiley Series in Probability and Statistics, Machine Learning offers the practical information needed for an understanding of the methods and application of machine learning for advanced undergraduate and beginner graduate students, data science and machine learning practitioners, and other technical professionals in adjacent fields.
Autoren/Hrsg.
Fachgebiete
Weitere Infos & Material
Preface xi
Organization — How to Use This Book xii
Acknowledgments xiv
About the Companion Website xiv
1 Introduction – Examples from Real Life 1
2 The Problem of Learning 3
2.1 Domain 3
2.2 Range 4
2.3 Data 4
2.4 Loss 5
2.5 Risk 8
2.6 The Reality of the Unknown Function 12
2.7 Training and Selection of Models 12
2.8 Purposes of Learning 14
2.9 Notation 14
3 Regression 15
3.1 General Framework 16
3.2 Loss 17
3.3 Estimating the Model Parameters 17
3.4 Properties of Fitted Values 19
3.5 Estimating the Variance 22
3.6 A Normality Assumption 23
3.7 Computation 25
3.8 Categorical Features 26
3.9 Feature Expansions, Interactions, and Transformations 28
3.10 Penalized Regression: Model Transformation for Risk Reduction 31
3.11 Variations in Linear Regression 37
3.12 Nonlinear Regression 39
3.13 Nonparametric Regression 42
4 Classification 45
4.1 The Bayes Classifier 46
4.2 Introduction to Classifiers 47
4.3 Mitigating Biases in Software, Biases in Data, and Zero Probabilities 49
4.3.1 Mitigating Biases in Software by Adjusting Loss and Prior Probabilities 50
4.3.2 Mitigating Biases in Data by Adjusting Loss or Prior Probabilities 51
4.3.3 Mitigating Effects of Zero-One Probability Estimates 52
4.4 Class Boundaries 53
4.5 A Running Example 54
4.6 Likelihood Methods 55
4.6.1 Quadratic Discriminant Analysis 56
4.6.2 Linear Discriminant Analysis 58
4.6.3 Gaussian Mixture Models 60
4.6.4 Kernel Density Estimation 61
4.6.5 Histograms 65
4.6.6 The Naive Bayes Classifier 68
4.7 Prototype Methods 69
4.7.1 k-Nearest-Neighbor 69
4.7.2 Condensed k-Nearest-Neighbor 72
4.7.3 Nearest-Cluster 73
4.7.4 Learning Vector Quantization 73
4.8 Logistic Regression 76
4.8.1 The Logistic Regression Model 76
4.8.2 Adjusting the Marginal or Prior Distribution of Classes 79
4.8.3 Class Boundaries, Hyperplanes, and Geometry 81
4.9 Neural Networks 81
4.9.1 Activation Functions 81
4.9.2 Neurons 82
4.9.3 Single-Hidden-Layer Neural Networks 84
4.9.4 Multi-Hidden-Layer Neural Networks 90
4.9.5 Adjusting the Marginal or Prior Distribution of Classes 91
4.9.6 Logistic Regression and Zero-Hidden-Layer Neural Networks 91
4.10 Classification Trees 93
4.10.1 Classification of Data by Leaves (Terminal Nodes) 93
4.10.2 Impurity of Nodes and Trees 94
4.10.3 Growing Trees 95
4.10.4 Pruning Trees 98
4.10.5 Regression Trees 99
4.11 Support Vector Machines 100
4.11.1 A Geometric Definition of “Good” 100
4.11.2 Support Vector Machine Classifiers for Linearly Separable Data 101
4.11.3 The Central Role of Inner Products 103
4.11.4 Support Vector Machine Classifiers for Data Not Linearly Separable 104
4.11.5 Slack Variables as Hinge Loss 105
4.11.6 Multiple Classes, General Loss, and Non-uniform Class Prior 107
4.11.7 Approximation of the Bayes Classifier 109
4.11.8 Inner Products via Kernel Functions 110
4.12 Postscript: Example Problem Revisited 119
5 Bias-Variance Trade-Off 121
5.1 Squared-Error Loss 121
5.2 General Loss 125
6 Combining Classifiers 131
6.1 Ensembles 131
6.2 Ensemble Design 136
6.3 Bootstrap Aggregation (Bagging) 138
6.4 Random Forests 141
6.5 Boosting and Arcing 142
6.6 Classification by Regression Ensemble 147
6.7 Gradient Boosting 151
6.8 Stacking and Mixture of Experts 156
6.9 Postscript: Example Problem Revisited 160
7 Risk Estimation and Model Selection 163
7.1 Risk Estimation via Training Data 164
7.2 Risk Estimation via Validation or Test Data 164
7.2.1 Training, Validation, and Test Data Sets 164
7.2.2 Training, Validation, and Test Estimates of Risk 165
7.2.3 Application – Precision of Validation and Test Estimates of Risk 166
7.2.4 Application – Comparing a Model’s Risk to a Target Value 166
7.2.5 Application – Comparing the Difference of Models’ Risks to a Target Value 168
7.3 Cross-Validation 169
7.4 Improvements on Cross-Validation 171
7.5 Out-of-Bag Risk Estimation 172
7.6 Akaike’s Information Criterion 173
7.7 Schwartz’s Bayesian Information Criterion 174
7.8 Rissanen’s Minimum Description Length Criterion 175
7.9 R 2 and Adjusted R 2 175
7.10 Stepwise Model Selection 177
7.11 Occam’s Razor 177
7.12 Size of Validation and Test Data Sets 178
7.12.1 Measures of Performance for Hypothesis Tests About Risk 178
7.12.2 Size of Training, Validation, and Test Data Sets 178
7.12.3 Example Construction of Training, Validation, and Test Data Sets 180
7.12.4 Example Use of Training and Validation Data Sets 183
7.12.5 Example Use of Test Data Sets 186
8 Consistency 187
8.1 Convergence of Sequences of Random Variables 187
8.2 Consistency for Parameter Estimation 188
8.3 Consistency for Prediction 188
8.4 There Are Consistent and Universally Consistent Classifiers 189
8.5 Convergence to Asymptopia Is Not Uniform and May Be Slow 191
9 Clustering 193
9.1 Gaussian Mixture Models 194
9.2 k-Means 194
9.3 Clustering by Mode-Hunting in a Density Estimate 195
9.4 Using Classifiers to Cluster 196
9.5 Dissimilarity 196
9.6 k-Medoids 197
9.7 k-Modes and k-Prototypes 197
9.8 Agglomerative Hierarchical Clustering 198
9.9 Divisive Hierarchical Clustering 199
9.10 How Many Clusters Are There? Interpretation of Clustering 200
9.11 An Impossibility Theorem 201
10 Optimization 203
10.1 Quasi-Newton Methods 204
10.1.1 The Newton–Raphson Method for Finding Zeros 204
10.1.2 The Newton–Raphson Method for Optimization 205
10.1.3 Gradient Descent 205
10.1.4 The Broyden–Fletcher–Goldfarb–Shanno Algorithm 205
10.1.5 Modifications to Quasi-Newton Methods 206
10.2 The Nelder–Mead Algorithm 207
10.3 Simulated Annealing 207
10.4 Genetic Algorithms 209
10.5 Particle Swarm Optimization 210
10.6 General Remarks on Optimization 211
10.6.1 Imperfectly Known Objective Functions 211
10.6.2 Objective Functions That Are Sums 212
10.6.3 Optimization from Multiple Starting Points 212
10.7 Solving Least-Squares Problems via Quasi-Newton Methods 213
10.8 Gradient Computation for Neural Networks via Backpropagation 214
10.9 Handling Missing Data via the Expectation-Maximization Algorithm 219
10.9.1 The General Algorithm 219
10.9.2 EM Climbs the Marginal Likelihood of the Observations 220
10.9.3 Example – Fitting a Gaussian Mixture Model via EM 222
10.9.4 Example – The Expectation Step 223
10.9.5 Example – The Maximization Step 224
10.10 Fitting Support Vector Machines via Sequential Minimal Optimization 224
10.10.1 Primal and Dual Forms of the Linear SVM Optimization Problem 225
10.10.2 Slater’s Condition and the Karush–Kuhn–Tucker Conditions 226
10.10.3 Generalization to Kernel Support Vector Machines 228
10.10.4 Computation of the Intercept 229
10.10.5 Solving the Dual Problem via Sequential Minimal Optimization 229
10.10.6 Step 1 – Choosing a Pair of Coordinates to Optimize 230
10.10.7 Step 2 – Constrained Optimization of a Pair of Coordinates 231
11 High-Dimensional Data 235
11.1 The Curse of Dimensionality 236
11.2 Two Running Examples 242
11.2.1 Example 1: Equilateral Simplex 242
11.2.2 Example 2: Text 242
11.3 Reducing Dimension While Preserving Information 243
11.3.1 The Geometry of Means and Covariances of Real Features 245
11.3.2 Principal Component Analysis 246
11.3.3 Working in “Dissimilarity Space” 248
11.3.4 Linear Multidimensional Scaling 248
11.3.5 The Singular Value Decomposition and Low-Rank Approximation 250
11.3.6 Stress-Minimizing Multidimensional Scaling 252
11.3.7 Projection Pursuit 252
11.3.8 Feature Selection 253
11.3.9 Clustering 254
11.3.10 Manifold Learning 254
11.3.11 Autoencoders 257
11.4 Model Regularization 261
11.4.1 Duality and the Geometry of Parameter Penalization 262
11.4.2 Parameter Penalization as Prior Information 263
12 Communication with Clients 267
12.1 Binary Classification and Hypothesis Testing 267
12.2 Terminology for Binary Decisions 269
12.3 Receiver Operating Characteristic (ROC) Curves 271
12.4 One-Dimensional Measures of Performance 273
12.5 Confusion Matrices 276
12.6 Pairwise Model Comparison 277
12.7 Multiple Testing 277
12.7.1 Control the Familywise Error 278
12.7.2 Control the False Discovery Rate 278
12.8 Expert Systems 279
12.9 Ethics in Machine Learning 280
12.9.1 Philosophical Foundations 280
12.9.2 Clear Goals 281
12.9.3 Good Practice 281
12.9.4 Machine Learning Might Not Be the Answer, and That’s OK 282
12.9.5 Documentation 282
13 Current Challenges in Machine Learning 283
13.1 Streaming Data 283
13.2 Distributed Data 283
13.3 Semi-Supervised Learning 283
13.4 Active Learning 284
13.5 Feature Construction via Deep Neural Networks 284
13.6 Transfer Learning 284
13.7 Interpretability and Protection of Complex Models 285
14 R and Python Source Code 287
14.1 Author’s Biases 288
14.2 Packages and Code 288
14.3 The Running Example (Section 4.5) 289
14.4 The Bayes Classifier (Section 4.1) 292
14.5 Quadratic Discriminant Analysis (Section 4.6.1) 294
14.6 Linear Discriminant Analysis (Section 4.6.2) 296
14.7 Gaussian Mixture Models (Section 4.6.3) 297
14.8 Kernel Density Estimation (Section 4.6.4) 300
14.9 Histograms (Section 4.6.5) 304
14.10 The Naive Bayes Classifier (Section 4.6.6) 309
14.11 k-Nearest-Neighbor (Section 4.7.1) 312
14.12 Learning Vector Quantization (Section 4.7.4) 314
14.13 Logistic Regression (Section 4.8) 317
14.14 Neural Networks (Section 4.9) 319
14.15 Classification Trees (Section 4.10) 324
14.16 Support Vector Machines (Section 4.11) 332
14.17 Bootstrap Aggregation (Bagging) (Section 6.3) 341
14.18 Random Forests (Section 6.4) 343
14.19 Boosting by Reweighting (Section 6.5) 345
14.20 Boosting by Sampling (Arcing) (Section 6.5) 346
14.21 Gradient Boosted Trees (Section 6.7) 347
Appendix-A: List of Symbols 351
Appendix-B: The Condition Number of a Matrix with Respect to a Norm 353
Appendix-C: Converting Between Normal Parameters and Level-Curve Ellipsoids 357
Appendix-D: The Geometry of Linear Functions and Linear Classifiers 359
Appendix-E: Training Data and Fitted Parameters 367
Appendix-F: Solutions to Selected Exercises 371
Bibliography 399
Index 413




