E-Book, Englisch, 369 Seiten
Mailund Beginning Data Science in R
1. ed
ISBN: 978-1-4842-2671-1
Verlag: Apress
Format: PDF
Kopierschutz: 1 - PDF Watermark
Data Analysis, Visualization, and Modelling for the Data Scientist
E-Book, Englisch, 369 Seiten
ISBN: 978-1-4842-2671-1
Verlag: Apress
Format: PDF
Kopierschutz: 1 - PDF Watermark
Discover best practices for data analysis and software development in R and start on the path to becoming a fully-fledged data scientist. This book teaches you techniques for both data manipulation and visualization and shows you the best way for developing new software packages for R.
Beginning Data Science in R details how data science is a combination of statistics, computational science, and machine learning. You'll see how to efficiently structure and mine data to extract useful patterns and build mathematical models. This requires computational methods and programming, and R is an ideal programming language for this.
This book is based on a number of lecture notes for classes the author has taught on data science and statistical programming using the R programming language. Modern data analysis requires computational skills and usually a minimum of programming.
What You Will LearnPerform data science and analytics using statistics and the R programming language
Visualize and explore data, including working with large data sets found in big data
Build an R package
Test and check your code
Practice version control
Profile and optimize your code
Who This Book Is For
Those with some data science or analytics background, but not necessarily experience with the R programming language.
<
Thomas Mailund is an associate professor in bioinformatics at Aarhus University, Denmark. His background is in math and computer science but for the last decade his main focus has been on genetics and evolutionary studies, particularly comparative genomics, speciation, and gene flow between emerging species.
Autoren/Hrsg.
Weitere Infos & Material
1;Contents at a Glance;4
2;Contents;5
3;About the Author;16
4;About the Technical Reviewer;17
5;Acknowledgments;18
6;Introduction;19
7;Chapter 1: Introduction to R Programming;24
7.1;Basic Interaction with R;24
7.2;Using R as a Calculator;26
7.2.1;Simple Expressions;26
7.2.2;Assignments;28
7.2.3;Actually, All of the Above Are Vectors of Values…;28
7.2.4;Indexing Vectors;29
7.2.5;Vectorized Expressions;30
7.3;Comments;31
7.4;Functions;31
7.4.1;Getting Documentation for Functions;32
7.4.2;Writing Your Own Functions;33
7.4.3;Vectorized Expressions and Functions;35
7.5;A Quick Look at Control Structures;35
7.6;Factors;39
7.7;Data Frames;41
7.8;Dealing with Missing Values;43
7.9;Using R Packages;44
7.10;Data Pipelines (or Pointless Programming);45
7.10.1;Writing Pipelines of Function Calls;46
7.10.2;Writing Functions that Work with Pipelines;46
7.10.3;The magical “.” argument;47
7.10.4;Defining Functions Using .;48
7.10.5;Anonymous Functions;49
7.10.6;Other Pipeline Operations;50
7.11;Coding and Naming Conventions;51
7.12;Exercises;51
7.12.1;Mean of Positive Values;51
7.12.2;Root Mean Square Error;51
8;Chapter 2: Reproducible Analysis;52
8.1;Literate Programming and Integration of Workflow and Documentation;53
8.2;Creating an R Markdown/knitr Document in RStudio;53
8.3;The YAML Language;56
8.4;The Markdown Language;57
8.4.1;Formatting Text;58
8.4.2;Cross-Referencing;61
8.4.3;Bibliographies;62
8.4.4;Controlling the Output (Templates/Stylesheets);62
8.5;Running R Code in Markdown Documents;63
8.5.1;Using Chunks when Analyzing Data (Without Compiling Documents);65
8.5.2;Caching Results;66
8.5.3;Displaying Data;66
8.6;Exercises;67
8.6.1;Create an R Markdown Document;67
8.6.2;Produce Different Output;67
8.6.3;Add Caching;67
9;Chapter 3: Data Manipulation;68
9.1;Data Already in R;68
9.2;Quickly Reviewing Data;70
9.3;Reading Data;71
9.4;Examples of Reading and Formatting Datasets;72
9.4.1;Breast Cancer Dataset;72
9.4.2;Boston Housing Dataset;78
9.4.3;The readr Package;79
9.5;Manipulating Data with dplyr;81
9.5.1;Some Useful dplyr Functions;82
9.5.1.1;select(): Pick Selected Columns and Get Rid of the Rest;82
9.5.1.2;mutate():Add Computed Values to Your Data Frame;84
9.5.1.3;Transmute(): Add Computed Values to Your Data Frame and Get Rid of All Other Columns;85
9.5.1.4;arrange(): Reorder Your Data Frame by Sorting Columns;85
9.5.1.5;filter(): Pick Selected Rows and Get Rid of the Rest;86
9.5.1.6;group_by(): Split Your Data Into Subtables Based on Column Values;87
9.5.1.7;summarise/summarize(): Calculate Summary Statistics;87
9.5.2;Breast Cancer Data Manipulation;88
9.6;Tidying Data with tidyr;92
9.7;Exercises;95
9.7.1;Importing Data;96
9.7.2;Using dplyr;96
9.7.3;Using tidyr;96
10;Chapter 4: Visualizing Data;97
10.1;Basic Graphics;97
10.2;The Grammar of Graphics and the ggplot2 Package;105
10.2.1;Using qplot();106
10.2.2;Using Geometries;110
10.2.3;Facets;119
10.2.4;Scaling;122
10.2.5;Themes and Other Graphics Transformations;127
10.3;Figures with Multiple Plots;131
10.4;Exercises;133
11;Chapter 5: Working with Large Datasets;134
11.1;Subsample Your Data Before You Analyze the Full Dataset;134
11.2;Running Out of Memory During Analysis;136
11.3;Too Large to Plot;137
11.4;Too Slow to Analyze;141
11.5;Too Large to Load;142
11.6;Exercises;145
11.6.1;Subsampling;145
11.6.2;Hex and 2D Density Plots;145
12;Chapter 6: Supervised Learning;146
12.1;Machine Learning;146
12.2;Supervised Learning;146
12.2.1;Regression versus Classification;147
12.2.2;Inference versus Prediction;148
12.3;Specifying Models;149
12.3.1;Linear Regression;149
12.3.2;Logistic Regression (Classification, Really);154
12.3.3;Model Matrices and Formula;157
12.4;Validating Models;166
12.4.1;Evaluating Regression Models;166
12.4.2;Evaluating Classification Models;168
12.4.2.1;Confusion Matrix;169
12.4.2.2;Accuracy;170
12.4.2.3;Sensitivity and Specificity;172
12.4.2.4;Other Measures;173
12.4.2.5;More Than Two Classes;174
12.4.3;Random Permutations of Your Data;174
12.4.4;Cross-Validation;178
12.4.5;Selecting Random Training and Testing Data;180
12.5;Examples of Supervised Learning Packages;182
12.5.1;Decision Trees;182
12.5.2;Random Forests;184
12.5.3;Neural Networks;185
12.5.4;Support Vector Machines;186
12.6;Naive Bayes;186
12.7;Exercises;187
12.7.1;Fitting Polynomials;187
12.7.2;Evaluating Different Classification Measures;187
12.7.3;Breast Cancer Classification;187
12.7.4;Leave-One-Out Cross-Validation (Slightly More Difficult);188
12.7.5;Decision Trees;188
12.7.6;Random Forests;188
12.7.7;Neural Networks;188
12.7.8;Support Vector Machines;188
12.7.9;Compare Classification Algorithms;188
13;Chapter 7: Unsupervised Learning;189
13.1;Dimensionality Reduction;189
13.1.1;Principal Component Analysis;189
13.1.2;Multidimensional Scaling;197
13.2;Clustering;201
13.2.1;k-Means Clustering;202
13.2.2;Hierarchical Clustering;208
13.3;Association Rules;212
13.4;Exercises;216
13.4.1;Dealing with Missing Data in the HouseVotes84 Data;216
13.4.2;Rescaling for k-Means Clustering;216
13.4.3;Varying k;216
13.5;Project 1;216
13.5.1;Importing Data;217
13.5.2;Exploring the Data;218
13.5.2.1;Distribution of Quality Scores;218
13.5.2.2;Is This Wine Red or White?;219
13.6;Fitting Models;223
13.7;Exercises;224
13.7.1;Exploring Other Formulas;224
13.7.2;Exploring Different Models;224
13.7.3;Analyzing Your Own Dataset;224
14;Chapter 8: More R Programming;225
14.1;Expressions;225
14.1.1;Arithmetic Expressions;225
14.1.2;Boolean Expressions;226
14.2;Basic Data Types;227
14.2.1;The Numeric Type;227
14.2.2;The Integer Type;228
14.2.3;The Complex Type;228
14.2.4;The Logical Type;228
14.2.5;The Character Type;229
14.3;Data Structures;229
14.3.1;Vectors;229
14.3.2;Matrix;230
14.3.3;Lists;232
14.3.4;Indexing;233
14.3.5;Named Values;235
14.3.6;Factors;236
14.3.7;Formulas;236
14.4;Control Structures;236
14.4.1;Selection Statements;236
14.4.2;Loops;238
14.4.3;A Word of Warning About Looping;239
14.5;Functions;240
14.5.1;Named Arguments;241
14.5.2;Default Parameters;242
14.5.3;Return Values;242
14.5.4;Lazy Evaluation;243
14.5.5;Scoping;244
14.5.6;Function Names Are Different from Variable Names;247
14.6;Recursive Functions;247
14.7;Exercises;249
14.7.1;Fibonacci Numbers;249
14.7.2;Outer Product;249
14.7.3;Linear Time Merge;249
14.7.4;Binary Search;250
14.7.5;More Sorting;250
14.7.6;Selecting the k Smallest Element;251
15;Chapter 9: Advanced R Programming;252
15.1;Working with Vectors and Vectorizing Functions;252
15.1.1;ifelse;254
15.1.2;Vectorizing Functions;254
15.1.3;The apply Family;256
15.1.3.1;apply;257
15.1.3.2;lapply;259
15.1.3.3;sapply and vapply;260
15.2;Advanced Functions;261
15.2.1;Special Names;261
15.2.2;Infix Operators;261
15.2.3;Replacement Functions;262
15.3;How Mutable Is Data Anyway?;264
15.4;Functional Programming;265
15.4.1;Anonymous Functions;265
15.4.2;Functions Taking Functions as Arguments;266
15.4.3;Functions Returning Functions (and Closures);266
15.4.4;Filter, Map, and Reduce;267
15.5;Function Operations: Functions as Input and Output;269
15.5.1;Ellipsis Parameters;272
15.6;Exercises;274
15.6.1;between;274
15.6.2;apply_if;274
15.6.3;power;274
15.6.4;Row and Column Sums;274
15.6.5;Factorial Again;274
15.6.6;Function Composition;275
16;Chapter 10: Object Oriented Programming;276
16.1;Immutable Objects and Polymorphic Functions;276
16.2;Data Structures;276
16.2.1;Example: Bayesian Linear Model Fitting;277
16.3;Classes;278
16.4;Polymorphic Functions;280
16.4.1;Defining Your Own Polymorphic Functions;281
16.5;Class Hierarchies;282
16.5.1;Specialization as Interface;282
16.5.2;Specialization in Implementations;283
16.6;Exercises;286
16.6.1;Shapes;286
16.6.2;Polynomials;286
17;Chapter 11: Building an R Package;287
17.1;Creating an R Package;287
17.1.1;Package Names;287
17.1.2;The Structure of an R Package;288
17.1.3;.Rbuildignore;288
17.1.4;Description;289
17.1.4.1;Title;289
17.1.4.2;Version;289
17.1.4.3;Description;290
17.1.4.4;Author and Maintainer;290
17.1.4.5;License;290
17.1.4.6;Type, Date, LazyData;290
17.1.4.7;URL and BugReports;290
17.1.4.8;Dependencies;291
17.1.4.9;Using an Imported Package;291
17.1.4.10;Using a Suggested Package;292
17.1.5;NAMESPACE;292
17.1.6;R/ and man/;293
17.2;Roxygen;293
17.2.1;Documenting Functions;293
17.2.2;Import and Export;294
17.2.3;Package Scope Versus Global Scope;295
17.2.4;Internal Functions;295
17.2.5;File Load Order;295
17.3;Adding Data to Your Package;296
17.4;Building an R Package;297
17.5;Exercises;298
18;Chapter 12: Testing and Package Checking;299
18.1;Unit Testing;299
18.1.1;Automating Testing;300
18.1.2;Using testthat;301
18.1.3;Writing Good Tests;302
18.1.4;Using Random Numbers in Tests;303
18.1.5;Testing Random Results;303
18.2;Checking a Package for Consistency;304
18.3;Exercise;304
19;Chapter 13: Version Control;305
19.1;Version Control and Repositories;305
19.2;Using git in RStudio;306
19.2.1;Installing git;306
19.2.2;Making Changes to Files, Staging Files, and Committing Changes;307
19.2.3;Adding git to an Existing Project;309
19.2.4;Bare Repositories and Cloning Repositories;309
19.2.5;Pushing Local Changes and Fetching and Pulling Remote Changes;310
19.2.6;Handling Conflicts;312
19.2.7;Working with Branches;312
19.2.8;Typical Workflows Involve Lots of Branches;315
19.2.9;Pushing Branches to the Global Repository;315
19.3;GitHub;315
19.3.1;Moving an Existing Repository to GitHub;317
19.3.2;Installing Packages from GitHub;318
19.4;Collaborating on GitHub;318
19.4.1;Pull Requests;318
19.4.2;Forking Repositories Instead of Cloning;319
19.5;Exercises;319
20;Chapter 14: Profiling and Optimizing;320
20.1;Profiling;320
20.1.1;A Graph-Flow Algorithm;321
20.2;Speeding Up Your Code;332
20.3;Parallel Execution;334
20.4;Switching to C++;337
20.5;Exercises;339
20.6;Project 2;339
20.7;Bayesian Linear Regression;340
20.7.1;Exercises: Priors and Posteriors;341
20.7.1.1;Sample from a Multivariate Normal Distribution;341
20.7.1.2;Computing the Posterior Distribution;343
20.7.2;Predicting Target Variables for New Predictor Values;345
20.8;Formulas and Their Model Matrix;347
20.8.1;Working with Model Matrices in R;348
20.8.2;Exercises;351
20.8.2.1;Building Model Matrices;351
20.8.2.2;Fitting General Models;351
20.8.3;Model Matrices Without Response Variables;351
20.8.4;Exercises;352
20.8.4.1;Model Matrices for New Data;352
20.8.4.2;Predicting New Targets;352
20.9;Interface to a blm Class;353
20.9.1;Constructor;353
20.9.2;Updating Distributions: An Example Interface;354
20.9.3;Designing Your blm Class;357
20.9.4;Model Methods;357
20.9.4.1;coefficients;357
20.9.4.2;confint;358
20.9.4.3;deviance;358
20.9.4.4;fitted;358
20.9.4.5;plot;358
20.9.4.6;predict;358
20.9.4.7;print;358
20.9.4.8;residuals;359
20.9.4.9;summary;359
20.10;Building an R Package for blm;359
20.10.1;Deciding on the Package Interface;359
20.10.2;Organization of Source Files;359
20.10.3;Document Your Package Interface Well;360
20.10.4;Adding README and NEWS Files to Your Package;360
20.10.4.1;README;361
20.10.4.2;NEWS;361
20.11;Testing;361
20.12;GitHub;361
20.13;Conclusions;361
20.13.1;Data Science;362
20.13.2;Machine Learning;362
20.13.3;Data Analysis;362
20.13.4;R Programming;362
20.14;The End;363
20.15;Acknowledgements;363
21;Index;364




