A Soft Introduction to Machine Learning

Important machine learning libraries

I created this document to serve as an easy introduction to basic machine learning methods. It is aimed at individuals who have some basic understanding of R and a basic understanding of mathematics.

To run through this course Yourself you will need to clone the repository:
https://github.com/lurd4862/Machine_learning_course_work.git

You will also need these packages installed:
- tidyr
- caTools
- ggplot2
- purrr
- dplyr
- data.table
- cluster
- e1071
- rpart
- caret
- xgboost

Future additions

  • Will add more on:
    • Thompson Sampling
    • Natural language processing
    • Deep learning

Part 1 - Data Preprocessing

Summary of things to do when processing the data for machine learning purposes: - Import data - Deal with NA’s - Encode categorical/text variables into numbers - Split into test and train - Apply feature scaling

For part 1 I will import some data first

#Import the data from 'data.csv'
dataset <-
  read.csv("../../static/data/intro_ml/Part_1/7_Data.csv", header = TRUE)

#method for dealing with NA using mean values
dataset$Age <- ifelse(is.na(dataset$Age),mean(dataset$Age,na.rm = T),dataset$Age)
dataset$Salary <- ifelse(is.na(dataset$Salary),mean(dataset$Salary,na.rm = T),dataset$Age)

X <-
  dataset[,-4]
Y <-
  dataset[, 4]

#Encode the categorical variables into numbers

#Purchased variable
dataset$Purchased <- ifelse(dataset$Purchased == "Yes", 1, 0)

dataset$Country <- dataset$Country %>% factor(levels = c("France", "Spain", "Germany"),
                                              labels = c(1, 2, 3))

# #If we want to further spread the Country into clasification (as with logistic regression) I would use this:
# dataset %>%
#   spread(Country, Country) %>%
#   rename(France = 1, Spain = 2, Germany = 3)

Splitting the dataset into a training and test set

#We use caTools
set.seed(123)
split <-
  sample.split(dataset$Purchased, SplitRatio = 0.8)
training_set <- subset(dataset, split == T)
test_set <- subset(dataset, split == F)

##Or the shorter way
# Train <- dataset[split,]
# Test <- dataset[!split,]

Feature scaling

Why feature scale? Because the different features (variables of the training data) are not on the same scale and because machine leaarning models use methods such as euclidean distance squaring of parameters may make some more significant than others which we do not want.

  • Avoid squared interactions overpowering small bias/observations

How do we feature scale? - Standardisation [x_stand <- (x-mean(x))/std(x)] - Normalisation [x_norm <- (x-min(x))/(max(x)-min(x))]

Scale our features for the data

Using built in tools:

training_set[,2:3] <- scale(training_set[,2:3] )
test_set[,2:3]  <- scale(test_set[,2:3] )

Here is how I decided to achieve scaling before seeing the results: - My application of applying the mentioned formulas for normalisation and standardisation does not = the scale methods results implying that the scale method is somehow optimised - Upon reading the help-text for scale it appears to centre and use a root-mean-squared error instead of a simple standard deviation when standardizing.

Manually Scale Age

# #Standardised
# dataset$Age <-
#   sapply(dataset$Age, function(x) (x-mean(dataset$Age))/sd(dataset$Age))
#
# #Normalised
# dataset$Age <-
#   sapply(dataset$Age, function(x) (x-min(dataset$Age))/(max(dataset$Age)-min(dataset$Age)))

Manually Scale Salary

#
# #Standardised
# dataset$Salary <-
#   sapply(dataset$Salary, function(x) (x-mean(dataset$Salary))/sd(dataset$Salary))
#
# #Normalised
# dataset$Salary <-
#   sapply(dataset$Salary, function(x) (x-min(dataset$Salary))/(max(dataset$Salary)-min(dataset$Salary)))

Part 2 - Regression

set.seed(123)

Simple Linear Regression

First we fit a linear model to the data

dataset <-
read.csv(file = "../../static/data/intro_ml/Part2_Regression/Simple_Linear_Regression/Salary_Data.csv")

dataset %>% tbl_df()
## # A tibble: 30 x 2
##    YearsExperience Salary
##              <dbl>  <dbl>
##  1            1.10  39343
##  2            1.30  46205
##  3            1.50  37731
##  4            2.00  43525
##  5            2.20  39891
##  6            2.90  56642
##  7            3.00  60150
##  8            3.20  54445
##  9            3.20  64445
## 10            3.70  57189
## # ... with 20 more rows
split <-
  sample.split(dataset$Salary, SplitRatio = 2/3)

training_set <-
  dataset[split,]

test_set <-
  dataset[!split,]

regressor <-
  lm(formula = Salary~., training_set)

Notice that we always test using the smaller split of the data

Now we Predict our values using the test data

y_pred <-
  predict(regressor, newdata = test_set) %>%
  data.frame

#Plot the predictions over the test
ggplot() +
  geom_point(data = test_set, aes(x = test_set$YearsExperience,
                                  y = test_set$Salary)) +
  geom_line(data = y_pred, aes(x = test_set$YearsExperience,
                               y = y_pred$.))+
  stat_smooth()

The prediction is obviously pretty good since the data was mostly linear already.

Multiple linear regression

Read in the data:

dataset <-
read.csv("../../static/data/intro_ml/Part2_Regression/Multiple-Linear-Regression/Multiple_Linear_Regression/50_Startups.csv")
dataset %>% tbl_df()
## # A tibble: 50 x 5
##    R.D.Spend Administration Marketing.Spend State      Profit
##        <dbl>          <dbl>           <dbl> <fct>       <dbl>
##  1    165349         136898          471784 New York   192262
##  2    162598         151378          443899 California 191792
##  3    153442         101146          407935 Florida    191050
##  4    144372         118672          383200 New York   182902
##  5    142107          91392          366168 Florida    166188
##  6    131877          99815          362861 New York   156991
##  7    134615         147199          127717 California 156123
##  8    130298         145530          323877 Florida    155753
##  9    120543         148719          311613 New York   152212
## 10    123335         108679          304982 California 149760
## # ... with 40 more rows

pre-process the data

dataset$State <- dataset$State %>% forcats::as_factor(levels = c("New York", "California", "Florida"),
labels = c(1, 2, 3))

Split the data:

split <-
  sample.split(dataset$Profit, SplitRatio = 8/10)

training_set <-
  dataset[split,]


test_set <-
  dataset[!split,]

Run the multiple linear regression model and compare the results

regressor <-
  lm(formula = Profit~., training_set)

summary(regressor)
## 
## Call:
## lm(formula = Profit ~ ., data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32392  -3950   -106   5102  19622 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.055e+04  8.802e+03   5.744 1.85e-06 ***
## R.D.Spend        8.223e-01  5.412e-02  15.194  < 2e-16 ***
## Administration  -3.815e-02  6.814e-02  -0.560    0.579    
## Marketing.Spend  2.176e-02  2.092e-02   1.040    0.306    
## StateFlorida     5.770e+01  4.013e+03   0.014    0.989    
## StateNew York   -1.377e+03  3.780e+03  -0.364    0.718    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9753 on 34 degrees of freedom
## Multiple R-squared:  0.9451, Adjusted R-squared:  0.9371 
## F-statistic: 117.2 on 5 and 34 DF,  p-value: < 2.2e-16
Profit_pred <-
  predict(regressor, test_set)

Compare_df <-
  cbind(test_set, Profit_pred)

Compare_df %>% mutate(Accuracy = abs(Profit_pred-Profit)/Profit)
##    R.D.Spend Administration Marketing.Spend      State    Profit
## 1  165349.20      136897.80        471784.1   New York 192261.83
## 2  162597.70      151377.59        443898.5 California 191792.06
## 3  144372.41      118671.85        383199.6   New York 182901.99
## 4   86419.70      153514.11             0.0   New York 122776.86
## 5   73994.56      122782.75        303319.3    Florida 110352.25
## 6   66051.52      182645.56        118148.2    Florida 103282.38
## 7   46426.07      157693.92        210797.7 California  96712.80
## 8   28663.76      127056.21        201126.8    Florida  90708.19
## 9   44069.95       51283.14        197029.4 California  89949.14
## 10  20229.59       65947.93        185265.1   New York  81229.06
##    Profit_pred    Accuracy
## 1    190190.15 0.010775293
## 2    188145.49 0.019013156
## 3    171708.30 0.061200461
## 4    114385.46 0.068346784
## 5    113374.25 0.027385052
## 6    100529.86 0.026650416
## 7     87301.17 0.097315271
## 8     73710.80 0.187385350
## 9     89123.92 0.009174326
## 10    67326.62 0.171151123

We can tell already that R & D spend is our most important variable, but does this mean no other variable is useful?

Now proceed with backward elimination

p_value <- 0.05

#Remove the State variable
regressor <-
  lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, dataset)

summary(regressor)
## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, 
##     data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33534  -4795     63   6606  17275 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.012e+04  6.572e+03   7.626 1.06e-09 ***
## R.D.Spend        8.057e-01  4.515e-02  17.846  < 2e-16 ***
## Administration  -2.682e-02  5.103e-02  -0.526    0.602    
## Marketing.Spend  2.723e-02  1.645e-02   1.655    0.105    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9232 on 46 degrees of freedom
## Multiple R-squared:  0.9507, Adjusted R-squared:  0.9475 
## F-statistic:   296 on 3 and 46 DF,  p-value: < 2.2e-16
#Remove the Administration variable
regressor <-
  lm(formula = Profit ~ R.D.Spend + Marketing.Spend, dataset)

summary(regressor)
## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33645  -4632   -414   6484  17097 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.698e+04  2.690e+03  17.464   <2e-16 ***
## R.D.Spend       7.966e-01  4.135e-02  19.266   <2e-16 ***
## Marketing.Spend 2.991e-02  1.552e-02   1.927     0.06 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared:  0.9505, Adjusted R-squared:  0.9483 
## F-statistic: 450.8 on 2 and 47 DF,  p-value: < 2.2e-16
#Remove the State & Marketing spend variable
regressor <-
  lm(formula = Profit ~ R.D.Spend + Administration, dataset)

summary(regressor)
## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration, data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34006  -4801   -303   6034  17843 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.489e+04  6.017e+03   9.122  5.7e-12 ***
## R.D.Spend       8.621e-01  3.016e-02  28.589  < 2e-16 ***
## Administration -5.300e-02  4.940e-02  -1.073    0.289    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9402 on 47 degrees of freedom
## Multiple R-squared:  0.9478, Adjusted R-squared:  0.9456 
## F-statistic: 426.8 on 2 and 47 DF,  p-value: < 2.2e-16

After trying different combinations of variables we can see that the marketing spend is also an important variable but some of its effect is already captured by other variables.

Polynomial Regression

Read in the data

dataset <-
read.csv("../../static/data/intro_ml/Part3_Polynomial_Regression/Polynomial_Regression/Position_Salaries.csv")

dataset %>% tbl_df()
## # A tibble: 10 x 3
##    Position          Level  Salary
##    <fct>             <int>   <int>
##  1 Business Analyst      1   45000
##  2 Junior Consultant     2   50000
##  3 Senior Consultant     3   60000
##  4 Manager               4   80000
##  5 Country Manager       5  110000
##  6 Region Manager        6  150000
##  7 Partner               7  200000
##  8 Senior Partner        8  300000
##  9 C-level               9  500000
## 10 CEO                  10 1000000

Create a linear regression model to compare (data is exponential)

regressor <- lm(formula = Salary~Level, dataset)

Linear_regressor <-
  predict(regressor, dataset) %>% data.frame()

ggplot() +
  geom_point(data = dataset, aes(x = dataset$Level,
                                  y = dataset$Salary)) +
  geom_line(data = dataset, aes(x = dataset$Level,
                               y = Linear_regressor$.))

Value_Prediction <-
  predict(regressor, data.frame(Level = 6.5))

We can tell that a linear model won’t do here, no QQ plot needed. So if we were to predict the salary between level 6 and 7 our estimate would be off:

Value_Prediction
##        1 
## 330378.8

Now we create the polynomial regressor…

Specifying every variable is actually very rigid and arduous so I used the poly function which enables a more accurate and also easier polynomial regression (pretty awesome function, uses orthogonal vectors)

Poly_Regressor <-
  lm(formula = Salary~poly(Level,9), dataset)
Poly_predict <-
  predict(Poly_Regressor, dataset) %>% data.frame()

ggplot() +
  geom_point(data = dataset, aes(x = dataset$Level,
                                  y = dataset$Salary)) +
  geom_line(data = y_pred, aes(x = dataset$Level,
                               y = Poly_predict$.))

Value_Prediction <-
  predict(Poly_Regressor, data.frame(Level = 6.5))

Value_Prediction
##        1 
## 172235.3

This fits pretty well…

But what about a log transform model?

Log_Regressor <-
  lm(formula = log(Salary)~Level, dataset)

Log_prediction <-
  exp(predict(Log_Regressor, dataset)) %>% data.frame()

ggplot() +
  geom_point(data = dataset, aes(x = dataset$Level,
                                  y = dataset$Salary)) +
  geom_line(data = y_pred, aes(x = dataset$Level,
                               y = Log_prediction$.))

# TruthOrBluff <-
#   Log_prediction[which(Log_prediction$. <= 160000),1]

Actualy the idea of a log regression is probably not a bad method if you require a smooth line without a high order polynomial which could increase R-Squared?.

Support vector machine regression

SVR is very straight forward to implement. We call the svm() function from the e1071 package similar to lm(). The function will automatically use classification algorythms if we are regressing a factor variable in the formula. It is still a good idea to specify the method outright.

Use the following main types:
- type = C-classification for classification
- type = eps-regression for regression of numeric variables.

Remember to do the usual data preparation; tokenize factors, scale variables. This will give you the best predictions.

Load the data

data <- read.csv(file = "../../static/data/intro_ml/Part2_Regression/SVR/SVR/Position_Salaries.csv")
head(data)
##            Position Level Salary
## 1  Business Analyst     1  45000
## 2 Junior Consultant     2  50000
## 3 Senior Consultant     3  60000
## 4           Manager     4  80000
## 5   Country Manager     5 110000
## 6    Region Manager     6 150000

Create SVR regressor and prediction

regressor <- svm(formula = Salary ~ .,data = data, type = "eps-regression")
y_pred <- predict(regressor, data) %>% data.frame()

Plot the regression

# Value_Prediction <-
#   predict(regressor, data.frame(Level = 6.5, Position = modelr::typical(data$Position)))

ggplot() +
  geom_point(data = data, aes(x = data$Level,
                                  y = data$Salary)) +
  geom_line(data = y_pred, aes(x = data$Level,
                               y = y_pred))

  # geom_point(data = y_pred, aes(x = 6.5,
  #                              y = Value_Prediction))

Not bad but seems to struggle with the exponential tail here, needs more data to be accurate.

Regression Trees

Intuition

CART stands for classification and regression trees.
There are, thereforem, 2 types of trees (as usual).

Watch the intuition video here: https://www.udemy.com/machinelearning/learn/v4/t/lecture/5732730?start=0

Regression trees are based on information entropy. Entropy is a measure of uncertainty or the value of an additional data point in terms of information.

Entropy is defined as \(E(-ln(P(X)))\). It is often taken with base 2 in which case it is defined as bit entropy. Base 2 is computationally more efficient and often better leverages in binary outcomes. In the case of a bernouli process the entropy would be: \(\sum -P(x_i)ln(P(x_i)) = \sum -1/2ln(1/2)\). The entropy is maximised when the odds are 1/2 because then we are the least certain

The algorythm will devide the data into branches where the entropy is above a certain threshold. Once it has created enough branches it will make the regression prediction based on the average of the predictions of each branch depending on which branch the values lie for which you are predicting.

Load the data

dataset <-
read.csv(file = "../../static/data/intro_ml/Part2_Regression/Decision_Tree_Regression/Decision_Tree_Regression/Position_Salaries.csv")

Run the regression

regressor <- rpart(formula = Salary ~ ., data = dataset)

y_predict <- predict(regressor, newdata = dataset) %>% data.frame

regressor %>% summary
## Call:
## rpart(formula = Salary ~ ., data = dataset)
##   n= 10 
## 
##     CP nsplit rel error xerror xstd
## 1 0.01      0         1      0    0
## 
## Node number 1: 10 observations
##   mean=249500, MSE=8.066225e+10

Plot results

ggplot() +
  geom_point(data = dataset, aes(x = dataset$Level,
                                  y = dataset$Salary)) +
  geom_line(data = y_predict, aes(x = dataset$Level,
                               y = y_predict))

What happened!

There are no clear decision trees formed so the model just took an average in 1 branch.

Let’s create more branches:

regressor <- rpart(formula = Salary ~ ., data = dataset, control = rpart.control(minsplit = 1))

y_predict <- predict(regressor, newdata = dataset) %>% data.frame

ggplot() +
  geom_point(data = dataset, aes(x = dataset$Level,
                                  y = dataset$Salary)) +
  geom_line(data = y_predict, aes(x = dataset$Level,
                               y = y_predict))

In reality the decision tree is actually just giving an average for each branch. So it doesn’t produce a smooth graph. It will return an average based on which branch/decision segment your independant predictor variables lie.

Random forest regression

Random forest is one method of ensemble learning. Generally speaking ensemble learning is fitting a model repeatedly and colating the results to create a more powerful and robust model.

Random forest intuition:
Pick n random points from the training set.
Build a decision tree on these n points.
Combine the trees to build a forest by predicting Y in each terminal mode and taking the average (either by counting out of total or predicting a probability in each tree and taking the mean).

Load data

salary_data <- data.table::fread("../../static/data/intro_ml/Part2_Regression/Random_Forest_Regression/Position_Salaries.csv")
head(salary_data)
##             Position Level Salary
## 1:  Business Analyst     1  45000
## 2: Junior Consultant     2  50000
## 3: Senior Consultant     3  60000
## 4:           Manager     4  80000
## 5:   Country Manager     5 110000
## 6:    Region Manager     6 150000

This is again the salary data

Fit model

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
predictor <-
  randomForest::randomForest(data = salary_data,Salary~Level,ntree=10)

and now we predict an arbitrary value:

y_predict <- predict(predictor, data.frame(Level = 6.5))

Let’s plot this 2d relationship to understand the naive model:

x_range <- seq(min(salary_data$Level),max(salary_data$Level),0.01)

ggplot(data = salary_data, aes(x = Level,y = Salary), color = "red")+
  geom_point()+
  geom_smooth()+
  geom_line(data = data.frame(Level = x_range),aes(x=x_range,y=predict(predictor,newdata=data.frame(Level = Level))), color = "blue")
## `geom_smooth()` using method = 'loess'

So in this simplified example we can nicely show how the random forest will attempt to predict the outcome variable in 2 dimensions. Obviously the model is much more powerful when working with larger and higher dimensional data. The randomForest package is particularly useful when explanetory variables are binary classes or multiple level factor variables since it has internal ways to dummify these cases and deal with them.

Pay particular attention to over fitting your data (test out of sample accuracy) and also to class imbalances when your variables are intricate.

A more robust application of machine learning regressions (random forest)

With real data the machine learning model will have to be trained and tested using a bit more rigor. In that case I suggest that the practitioner use the caret workflow.

Using the Iris dataset we illustrate:

classification_data = iris

cvCtrl = caret::trainControl(method = "LGOCV",
                             p = 0.8,
                             number = 10,
                             savePredictions = T)

caret_forest_model <-
      caret::train(data = classification_data, Species~., method = "rf", keep.forest=TRUE, trControl = cvCtrl
                   # , sampsize=10000,ntree=40
                   )

caret_forest_model %>% plot

varImp(caret_forest_model) %>% plot

caret_forest_model$results
##   mtry  Accuracy Kappa AccuracySD    KappaSD
## 1    2 0.9466667 0.920 0.04766136 0.07149204
## 2    3 0.9466667 0.920 0.04766136 0.07149204
## 3    4 0.9433333 0.915 0.04458312 0.06687468

So we got a 93% accuracy when classifying the species using 10 fold leave one out cross validation using a 80% traning split

We can clearly see from the accuracy graph that we did not have enough data to improve our model by using more complicated predictors in our trees

In this short pipeline you can split your data into training and testing , k-fold cross validate, using up or down sampling methods to deal with class imbalances, run on multiple cores and tune your parameters automatically or manually for a very large number of models git The tuning of parameters can be done manually by setting:

grid <- expand.grid(size=c(5,10,20,50), k=c(1,2,3,4,5))
model <- train(Species~., data=iris, method="lvq", trControl=control, tuneGrid=grid)

To run on multiple cores use the following workflow before training the model:

  cores <- cores
  cl <- makeCluster(cores)
  registerDoParallel(cl)

  # ---> train

  stopCluster(cl)

Part 3 - Clustering

K-means

Load the data

dataset <-
  data.table::fread("../../static/data/intro_ml/Part_4_Clustering/K_Means/K_Means/Mall_Customers.csv")
  # read.csv(file = "Part_4_Clustering/K_Means/K_Means/Mall_Customers.csv")

head(dataset)
##    CustomerID  Genre Age Annual Income (k$) Spending Score (1-100)
## 1:          1   Male  19                 15                     39
## 2:          2   Male  21                 15                     81
## 3:          3 Female  20                 16                      6
## 4:          4 Female  23                 16                     77
## 5:          5 Female  31                 17                     40
## 6:          6 Female  22                 17                     76

Find best number of clusters using elbow method. Generally there is a threshold were creating more groups will not be meaningful.

subset <-
  dataset %>% select(`Annual Income (k$)`, `Spending Score (1-100)`)

wcss <- vector()

for(i in 1:50){
  wcss[i] <- sum(kmeans(subset,i)$withinss)
}

wcss %>% plot

looks like ~5 clusters will give us a meaningfull number of clusters

Create the clusters

On variables income, spend and score:

cluster_1 <- kmeans(dataset %>% select(4:5), 5, iter.max = 1000, nstart = 10)

Plot the clusters

clusplot(
  dataset %>% select(4:5),
  cluster_1$cluster,
  lines = 0,
  shade = FALSE,
  color = TRUE,
  labels = 1,
  plotchar = FALSE,
  span = TRUE
  # main = "main title",
  # xlab = "x title",
  # ylab = "y title"
)

Apparently the k-means algorythm thinks these 2 columns describe all the variability. So a regression with only these factors should produce a good R^2

Let’s try adding some more columns and see how the clustering algorithm behaves:

To include all the other variables we will have to tokenize the Genre column into a true/false 1/0 so that the algorythm can calculate euclidean distance, one might also want to normalize the variables

subset_2 <-
  dataset %>%
  select(Genre:`Spending Score (1-100)`) %>%
  mutate(Genre = factor(Genre, labels = c(0,1),levels = c("Female","Male")))
  # mutate(Genre = factor(Genre, levels = c(0,1),labels = c("Female","Male")))


wcss_2 <- vector()

for(i in 1:50){
  wcss_2[i] <- sum(kmeans(subset_2,i)$withinss)
}

wcss_2 %>% plot

# Quicly look at points and gradients/skewness
wcss_2 %>%
  tbl_df() %>%
  arrange(-value) %>%
  mutate(clusters = seq_along(value),
         gradient = lag(value)-value,
         gradient_rate_change = lag(gradient)-gradient) %>%
  mutate_all(round)
## # A tibble: 50 x 4
##     value clusters gradient gradient_rate_change
##     <dbl>    <dbl>    <dbl>                <dbl>
##  1 308862     1.00       NA                   NA
##  2 219909     2.00    88953                   NA
##  3 157201     3.00    62708                26245
##  4 104415     4.00    52786                 9922
##  5  82657     5.00    21758                31028
##  6  58349     6.00    24308               - 2551
##  7  53238     7.00     5111                19197
##  8  51131     8.00     2107                 3004
##  9  45110     9.00     6021               - 3914
## 10  43507    10.0      1603                 4418
## # ... with 40 more rows
# Using the 2nd derivitive as an aproximation we use 7 clusters as the optimal number

cluster_2 <- kmeans(subset_2, 7, iter.max = 1000, nstart = 10)

When we look at the kmeans output it has clearly used all the new columns. But the clusplot function can only plot 2 variables so we dont plot anything here.

Hierarchical clustering

There are broad approaches to H-clustering:
- Agglomerative
- Divisive

Agglomerative starts from the bottom and builds everything up (using a dendogram), and Divisive is the opposite.

HC uses distance between clusters.
This can be defined in different ways since a cluster is a group of points!

Steps:
1. Make a cluster for each point
2. Make n-1 clusters by grouping together the closest clusters
3. repeat step 2 untill you have 1 cluster left

Dendograms Agglomerative

A dendogram is a plot of distances for each cluster from another showing how clusters where connected given their distances

For Agglomerative the dendogram will build larger clusters from smaller clusters like building a tower.

Choosing the number of clusters using the dendogram

By looking at the dendogram we can determine the number of clusters we want to use by identifying some threshold on vertical distance that we do not want to cross.

Here vertical distance refers to the distance between the clusters being joined on the current iteration

Rule of thumb: make the threshold somewhere in the height of the cluster with the largest vertical distance. Repeat this process untill the next vertical distance is not significantly larger than the previous

Read in the data

dataset <-
  fread("../../static/data/intro_ml/Part_4_Clustering/Hierarchical-Clustering/Hierarchical_Clustering/Mall_Customers.csv")

head(dataset)
##    CustomerID  Genre Age Annual Income (k$) Spending Score (1-100)
## 1:          1   Male  19                 15                     39
## 2:          2   Male  21                 15                     81
## 3:          3 Female  20                 16                      6
## 4:          4 Female  23                 16                     77
## 5:          5 Female  31                 17                     40
## 6:          6 Female  22                 17                     76
subset <-
  dataset %>% select(`Annual Income (k$)`,`Spending Score (1-100)`)

Plot dendogram

method = "ward.D" will minimise each cluster’s within_variance

HC <- hclust(d = dist(subset, method = "euclidean") ,
                    method = "ward.D")

HC %>% plot

# HC %>% plot(ylim=c(0, 200) )

We can clearly see from this dendogram that the threshold will be pretty low to achieve a reasonable number of groups

Choose number of clusters:

We now cut the dendogram tree and choos our number of clusters

HC_clusters <-
HC %>% cutree(k = 5) # 5 clusters

Plot the HC results

clusplot(
  subset,
  HC_clusters,
  lines = 0,
  shade = FALSE,
  color = TRUE,
  labels = 1,
  plotchar = FALSE,
  span = TRUE,
  main = "main title",
  xlab = "x title",
  ylab = "y title"
)

Part 4 - Dimensionality Reduction

Load data

dataset <-
  read.csv(file = "../../static/data/intro_ml/Part_9_Dimensionality_Reduction/PCA/Wine.csv")

dataset %>% head
##   Alcohol Malic_Acid  Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids
## 1   14.23       1.71 2.43         15.6       127          2.80       3.06
## 2   13.20       1.78 2.14         11.2       100          2.65       2.76
## 3   13.16       2.36 2.67         18.6       101          2.80       3.24
## 4   14.37       1.95 2.50         16.8       113          3.85       3.49
## 5   13.24       2.59 2.87         21.0       118          2.80       2.69
## 6   14.20       1.76 2.45         15.2       112          3.27       3.39
##   Nonflavanoid_Phenols Proanthocyanins Color_Intensity  Hue OD280 Proline
## 1                 0.28            2.29            5.64 1.04  3.92    1065
## 2                 0.26            1.28            4.38 1.05  3.40    1050
## 3                 0.30            2.81            5.68 1.03  3.17    1185
## 4                 0.24            2.18            7.80 0.86  3.45    1480
## 5                 0.39            1.82            4.32 1.04  2.93     735
## 6                 0.34            1.97            6.75 1.05  2.85    1450
##   Customer_Segment
## 1                1
## 2                1
## 3                1
## 4                1
## 5                1
## 6                1

We will try out some dimensionality reduction techniques on wine data.

Overview

There are 2 ways of reducing the dimensions of your dataset before fitting a model:
1. Feature selection. Like backward or forward elimination we did during Part 2 regression or using scores. This aims at removing features that are not significant in improving prediction in our dependant variable.
2. Feature extraction. This aims at combining features into fewer variables (dimensions). This is the focus of this chapter.

PCA

Assume customer segment is the result of clustering customers on the independant variables… Using this data we could fit a classification model to better predict which customer segment will like a new wine that has certain values for these variables.

But in order to plot the link between the independant variables and the segment classification we would have to visualize too many variables.

Also, to inspect this using a table is very difficult when you have many variables. Instead we use PCA to reduce the number of dimensions to 2 independant variables so that we can more easily tell a story of how customers are segmented (describe relationships that add customer insight).

Split into test and train sets

set.seed(123)
split <-  sample.split(dataset$Customer_Segment, SplitRatio = 0.8)
training_set <- dataset %>% subset(split)
test_set <- dataset %>% subset(!split)

Feature scaling

training_set[,1:13] <- scale(training_set[,1:13])
test_set[,1:13] <- scale(test_set[,1:13])

Create PCA

There are 2 popular packages for dimensionality reduction and they are: - Caret
- Factominer

Factominer has some really neat and cool functions for data mining and visualization.
On the other hand caret fits really well into a machine learning model workflow.

Using Caret

If you want to reduce the dimensionality while explaining at least a certain threshold of variation in y you can set this threshold as:
- thresh = 0.7

If however you know how many features you want to extract use:
- pcaComp = n (this will overwrite thresh)

pca_caret <-
  preProcess(x = training_set[, -14],
  method = "pca",
  pcaComp = 2)

  prediction_caret <-
  pca_caret %>% predict(newdata = test_set[, -14])
  
  PCA_caret <- prediction_caret %>% cbind(test_set[,14])

  prediction_caret %>% plot()

This plot is pretty plain looking and its hard to tell a story from just this… So if you need to explain what your dimensions are made up off this is not the way to go.

Using FactoMineR

To set number of features here you use
- ncp = n

pca_Facto <-
  FactoMineR::PCA(X = training_set[,-14],
                  ncp = 2)

predict_Facto <-
pca_Facto %>% predict(newdata = test_set[,-14])

pca_Facto %>% plot

PCA_factominer <- predict_Facto$coord %>% data.frame() %>% cbind(test_set[,14])
PCA_factominer %>% head
##       Dim.1     Dim.2 test_set[, 14]
## 4  3.494230 2.7730715              1
## 5  1.040112 0.9872953              1
## 8  1.993220 1.5575554              1
## 11 3.341683 1.2490538              1
## 16 2.244393 1.6405095              1
## 20 2.119558 1.0043728              1
pca_Facto %>% summary()
## 
## Call:
## FactoMineR::PCA(X = training_set[, -14], ncp = 2) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               4.673   2.399   1.537   0.959   0.834   0.676
## % of var.             35.947  18.451  11.823   7.373   6.419   5.203
## Cumulative % of var.  35.947  54.398  66.221  73.594  80.013  85.216
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.564   0.363   0.298   0.229   0.216   0.144
## % of var.              4.337   2.796   2.296   1.758   1.660   1.106
## Cumulative % of var.  89.553  92.349  94.645  96.404  98.064  99.170
##                       Dim.13
## Variance               0.108
## % of var.              0.830
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                          Dist    Dim.1    ctr   cos2    Dim.2    ctr
## 1                    |  3.998 |  3.261  1.603  0.665 |  1.572  0.725
## 2                    |  3.315 |  2.174  0.712  0.430 | -0.320  0.030
## 3                    |  3.348 |  2.510  0.949  0.562 |  1.240  0.451
## 6                    |  3.988 |  2.951  1.313  0.548 |  2.308  1.564
## 7                    |  3.402 |  2.402  0.869  0.498 |  1.327  0.517
## 9                    |  3.509 |  2.427  0.888  0.479 |  1.040  0.318
## 10                   |  3.286 |  2.684  1.085  0.667 |  0.908  0.242
## 12                   |  2.874 |  1.724  0.448  0.360 |  0.690  0.140
## 13                   |  2.924 |  2.080  0.652  0.506 |  0.820  0.197
## 14                   |  4.846 |  3.366  1.707  0.482 |  1.429  0.600
##                        cos2  
## 1                     0.155 |
## 2                     0.009 |
## 3                     0.137 |
## 6                     0.335 |
## 7                     0.152 |
## 9                     0.088 |
## 10                    0.076 |
## 12                    0.058 |
## 13                    0.079 |
## 14                    0.087 |
## 
## Variables (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## Alcohol              |  0.264  1.489  0.070 |  0.762 24.176  0.580 |
## Malic_Acid           | -0.509  5.542  0.259 |  0.312  4.056  0.097 |
## Ash                  |  0.009  0.002  0.000 |  0.519 11.243  0.270 |
## Ash_Alcanity         | -0.494  5.215  0.244 |  0.010  0.004  0.000 |
## Magnesium            |  0.294  1.852  0.087 |  0.424  7.500  0.180 |
## Total_Phenols        |  0.842 15.153  0.708 |  0.153  0.972  0.023 |
## Flavanoids           |  0.908 17.627  0.824 |  0.038  0.061  0.001 |
## Nonflavanoid_Phenols | -0.650  9.048  0.423 |  0.118  0.582  0.014 |
## Proanthocyanins      |  0.674  9.729  0.455 |  0.121  0.609  0.015 |
## Color_Intensity      | -0.282  1.704  0.080 |  0.815 27.710  0.665 |

Wow! That’s great, we can clearly see how our dimensions are created. Once we have fit a model and predict something like say a wine rating we will be able to make some conjectures of which variables tend to give us good scores.

In this case we already have segments and we can now figure out how these segments were created in the first place.

Fit classification model

We can quicly fit a classification model to the feature scaled data. For simplicity we use basic classification models.

First we test using the full scaled dataset:

Classifier_glm <-
  glm(data = test_set,
  formula = Customer_Segment ~ .)

Classifier_glm
## 
## Call:  glm(formula = Customer_Segment ~ ., data = test_set)
## 
## Coefficients:
##          (Intercept)               Alcohol            Malic_Acid  
##              1.94444              -0.01752               0.07639  
##                  Ash          Ash_Alcanity             Magnesium  
##             -0.14726               0.08934              -0.06844  
##        Total_Phenols            Flavanoids  Nonflavanoid_Phenols  
##             -0.03352              -0.26106              -0.05597  
##      Proanthocyanins       Color_Intensity                   Hue  
##             -0.04627               0.16147              -0.06577  
##                OD280               Proline  
##             -0.22850              -0.19197  
## 
## Degrees of Freedom: 35 Total (i.e. Null);  22 Residual
## Null Deviance:       21.89 
## Residual Deviance: 0.5739    AIC: -16.84
confusion_matrix <- table(test_set$Customer_Segment, Classifier_glm %>% predict %>% round)

confusion_matrix
##    
##      1  2  3
##   1 12  0  0
##   2  0 14  0
##   3  0  0 10
accuracy <- confusion_matrix %>% diag() %>% sum / confusion_matrix %>% sum
data.frame(accuracy = accuracy)
##   accuracy
## 1        1
Classifier_svm <-
  svm(x = test_set,
  type = "C-classification",
  y = test_set$Customer_Segment)

Classifier_svm
## 
## Call:
## svm.default(x = test_set, y = test_set$Customer_Segment, type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.07142857 
## 
## Number of Support Vectors:  28
confusion_matrix <- table(test_set$Customer_Segment, Classifier_svm %>% predict)
confusion_matrix
##    
##      1  2  3
##   1 12  0  0
##   2  0 14  0
##   3  0  0 10
accuracy <- confusion_matrix %>% diag() %>% sum / confusion_matrix %>% sum
data.frame(accuracy = accuracy)
##   accuracy
## 1        1

So without even using any dimensionality reduction it is possible for us to create a model with 100% accuracy at predicting these segments using both a glm and a svm model. That means we will be able to classify any new wines very well using our data and these models.

But can we make this prediction using only 2 variables/components?

Now we repeat testing using only the PCA datasets and the SVM method:

Classifier_svm_facto <-
  svm(x = PCA_factominer[,-3],
  type = "C-classification",
  y = PCA_factominer[,3])

Classifier_svm_facto
## 
## Call:
## svm.default(x = PCA_factominer[, -3], y = PCA_factominer[, 3], 
##     type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  14
confusion_matrix <- table(test_set$Customer_Segment, Classifier_svm_facto %>% predict)
confusion_matrix
##    
##      1  2  3
##   1 11  1  0
##   2  0 14  0
##   3  0  0 10
accuracy <- confusion_matrix %>% diag() %>% sum / confusion_matrix %>% sum
data.frame(accuracy = accuracy)
##    accuracy
## 1 0.9722222
Classifier_svm_caret <-
  svm(x = PCA_caret[,-3],
  type = "C-classification",
  y = PCA_caret[,3])

Classifier_svm_caret
## 
## Call:
## svm.default(x = PCA_caret[, -3], y = PCA_caret[, 3], type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  14
confusion_matrix <- table(test_set$Customer_Segment, Classifier_svm_caret %>% predict)
confusion_matrix
##    
##      1  2  3
##   1 11  1  0
##   2  0 14  0
##   3  0  0 10
accuracy <- confusion_matrix %>% diag() %>% sum / confusion_matrix %>% sum
data.frame(accuracy = accuracy)
##    accuracy
## 1 0.9722222

We can see that reducing the entire dataset into only 2 principal components did not reduce our prediction accuracy. That means we did not loose any predictive power by reducing the number of factors in the dataset using the 2 different PCA methods.

Generally we will have data with multiple level factor variables and also continuous variables all mixed together. When we have to reduce the dimensions of these relationary tables we should use more advanced methods such as FactoMineR::MFA()

Part 5 - Reinforced Learning

Reinforced learning, also reffered to as online learning, is a branch of machine learning where we use the observed filtration up to time t to make a decision regarding time t+1. The AI system is rewarded when a desired outcome is reached and penalized otherwise.

We will solve the multi-armed bandit problem to illustrate the use of reinforced learning, however there are various possible applications.

Multi-Armed Bandit Problem

This model assumes that you have many different projects/nodes where you could spend your resources (like a row of slot machines at a casino, hence the name). From the shopping centre’s point of view, they want to maximise profits by buying the most profitable set of items. But before they know which items may be the most profitable (which slot machine has the highest returns) they must deal with the tradeoff of exploring new products vs. stocking the products they currently know to be the most proffitable.

Traditionally one would use AB tests if you were purely interested in exploring the different projects/options. By measuring the case vs. control you can say with a certain level of confidence if one is in fact better than the other. But this does not allow you to maximise your profits while you are exploring the different options. In essence: we want to test the different options whilst simultaniously exploiting the ones we believe to be the best given information gathered thus far.

Upper Confidence Bound (UCB) method

To employ the UCB method we look at the following metrics:
- Average reward \(\bar{r}_i(n)=\frac{R_i(n)}{N_i(n)}\) - Confidence interval \(\pm\Delta_i(n)=\sqrt{\frac{3log(n)}{2N_i(n)}}\)

By using the upper bound of the confidence interval of the mean reward we can prioritise those adds that currently have more possible headroom for reward while still placing emphasis on the expected amount of reward.

e.g. Immagine we have 2 adds that currently have the same average return. Add A and B respectively. Add A has only 2 trials where we tried using it. Add B has 100 trials where we used it. If we are more unsure about how volatile add A could be we will therefore want to prioritise using A instead of B because add A merits exploration and has good exploitation. This way we can balance these apposing views while still maximising expected reward.

Get the data

CTR stands for click through rate, which is the behaviour we want to optimize since we want to choose the add which maximises the number of people clicking on it

data_raw <-
  data.table::fread("../../static/data/intro_ml/Part6_Reinforcement_learning/UCB/Ads_CTR_Optimisation.csv") %>%
  tbl_df()

data_raw
## # A tibble: 10,000 x 10
##    `Ad 1` `Ad 2` `Ad 3` `Ad 4` `Ad 5` `Ad 6` `Ad 7` `Ad 8` `Ad 9` `Ad 10`
##     <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>   <int>
##  1      1      0      0      0      1      0      0      0      1       0
##  2      0      0      0      0      0      0      0      0      1       0
##  3      0      0      0      0      0      0      0      0      0       0
##  4      0      1      0      0      0      0      0      1      0       0
##  5      0      0      0      0      0      0      0      0      0       0
##  6      1      1      0      0      0      0      0      0      0       0
##  7      0      0      0      1      0      0      0      0      0       0
##  8      1      1      0      0      1      0      0      0      0       0
##  9      0      0      0      0      0      0      0      0      0       0
## 10      0      0      1      0      0      0      0      0      0       0
## # ... with 9,990 more rows

How would we score if we chose an add at random?

I decided to write a quick closure funtion that does this:

Random_strategy_fn <- function(data){
  ads_selected <-
    integer(0)
  total_clicks <- 0
  i <- 1

function(data=NULL){
  if (!is.null(data)){
  N <-
    dim(data)[1]
  M <-
    dim(data)[2]

    ad <- sample(1:M,1)
    ads_selected <<- append(ads_selected, ad)
    total_clicks <<- total_clicks + flatten_dbl(data[i,ad])
    i <<- i+1
  } else {
    return(list(
                ads_selected = ads_selected,
                total_clicks = total_clicks
                )
    )
  }
}  

}

Now we call the function to choose random adds and compare them to the simulated data:

This will call our function that tries adds at random and score them based on how well they did.

Score_random_add_selection <- Random_strategy_fn(data_raw)

for(i in 1:nrow(data_raw)){

Score_random_add_selection(data_raw)
}

# Score_random_add_selection() %>% head

So when we simply pick adds at random we will get around ~1300 hits (random variation).

Confirm that the adds were selected at random:

# Score_random_add_selection()$ads_selected %>% hist()
Score_random_add_selection()$ads_selected %>% qplot()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Can we improve our add campaign by learning which adds are getting hits while the adds are being used live?

Improve results using UCB

Again, instead of simply executing a for loop I am going to construct another lexical closure function so that the function itself can be reused in the future on any similar dataset without having to re-learn the process.

The main benefit I see in doing it like this is that the function can be continuously called with live data and its results will keep updating. Something that would otherwise be impossible if the loops are being called explicitly. To do this one would simply have to adapt the way the function is called (each time new data is recieved).

Create a little closure function:

# dataset <- data_raw %>% data.frame
# dataset <- data_raw

Create_UCB_Model_fn <- function(data){

  data <- data %>% data.frame

N <- 10000
d <- 10
ads_selected <- integer(0)
numbers_of_selections <- integer(d)
sums_of_rewards <- integer(d)
total_reward <- 0

    function(data=NULL){
    if(!is.null(data)){
      for (n in 1:N) {
        ad <- 0
        max_upper_bound <- 0
        for (i in 1:d) {
          if (numbers_of_selections[i] > 0) {
            average_reward <- sums_of_rewards[i] / numbers_of_selections[i]
            delta_i <- sqrt(3/2 * log(n) / numbers_of_selections[i])
            upper_bound <- average_reward + delta_i
          } else {
              upper_bound <- 1e400
          }
          if (upper_bound > max_upper_bound) {
            max_upper_bound <- upper_bound
            ad <- i
          }
        }
        ads_selected <<- append(ads_selected, ad)
        numbers_of_selections[ad] <<- numbers_of_selections[ad] + 1
        reward <- data[n, ad]
        sums_of_rewards[ad] <<- sums_of_rewards[ad] + reward
        total_reward <<- total_reward + reward
      }
    } else {
      return(list(
        ads_selected = ads_selected,
        numbers_of_selections = numbers_of_selections,
        sums_of_rewards = sums_of_rewards,
        total_reward = total_reward
      ))
    }
  }
}

Now we initialize and call this function on all the new data:

data_raw_df <- data_raw %>% data.frame
# Score_UCB_add_selection <- UCB_strategy_fn(data_raw)
Score_UCB_add_selection <- Create_UCB_Model_fn(data_raw_df)

# for(i in 1:(dim(data_raw)[1]*dim(data_raw)[2])){
Score_UCB_add_selection(data.frame(data_raw))
# }

# Score_UCB_add_selection() %>% head
# Score_UCB_add_selection()$numbers_of_selections

Explaining the output:

*numbers_of_selections* refers to the number of times the model decided to display each add.
*sums_of_rewards* refers to the accuracy/click rate of the each add when it was correctly displayd and clicked on.

Visualize the model add selection

Score_UCB_add_selection()$ads_selected %>% qplot()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here we can see the massive difference between simply selecting the add at random and using a reinforced learning algorithm. Clearly the algorithm very quickly converged to the 5th add out of the 10 adds. This means that on average the people in this test group clearly preffered ad 5 over the other adds. This is further supported by the fact that the UCB method gave us almost twice as many hits.

Of course, to reiterate, we could’ve come to the same conclusion using an A-B test and then simply used add 5 only. But using this method we can now safely decide on using add 5 whilst also having optimized the return during the trial phase. This is even more handy when we have many new users that all need to be tested (since this can be unsupervised and successive A-B tests would be too expensive on time/resources).

Part 6 - Parameter Grid Search, Cross-validation and Boosting

Load data:

social_network_ads_data <-
  data.table::fread("../../static/data/intro_ml/Part_10_Boosting/Model_Selection/Social_Network_Ads.csv")

social_network_ads_data %>% head
##     User ID Gender Age EstimatedSalary Purchased
## 1: 15624510   Male  19           19000         0
## 2: 15810944   Male  35           20000         0
## 3: 15668575 Female  26           43000         0
## 4: 15603246 Female  27           57000         0
## 5: 15804002   Male  19           76000         0
## 6: 15728773   Male  27           58000         0

Cross validation

The purpose of crossvalidation is to see the average performance of the model accross different folds of the data.

Intuition

Cross-validation is essential when the practitioner believes that the modelling algorithms being employed is at risk of overfitting the data. In these cases cross-validation will split the data into different folds. By holding each fold out as the testing set the model can now test the out-of-sample accuracy looping through each fold while still leveraging all the rest of the data and then averaging out the model performance for these non-independent folds.

In practise it would be better to run multiple independant experiments, but this would be time consuming and we dont have time or money to run multiple experiments.

How to

This can be interpreted via the confusion matrix. A perfect accuracy score would yield a confusion matrix where all entries are on the diagonal (i.e. if it was predicted a x_hat it was actualy x).

The confusion matrix can be derived manually by using the table function. When used on a single vector the table function behaves like a group by count(*). When used on two vectors it will count the overlaps bewteen unique items in both sets.

library(caret)
library(e1071)

# a bottom up approach
folds = createFolds(iris, k = 10)
cv = lapply(folds, function(x){
  training_fold = iris[-x,]
  test_fold = iris[x,]
  classifier = svm(formula = Species ~ .,
                   data = training_fold,
                   type = 'C-classification',
                   kernel = 'radial'
                   )
  y_pred = predict(classifier,newdata = test_fold[,-5])
  outp = table(y_pred,test_fold$Species)
  # confusionMatrix(classifier)
})

This kindof bottom up approach illustrates how one might consider doing this. BUT this is arduous at best so let’s use more advanced methods:

# a more correct approach using built methods
  classifier = svm(formula = Species ~ .,
                   data = iris,
                   type = 'C-classification',
                   kernel = 'radial',
                   cross=10
                   )
  y_pred = predict(classifier,newdata = iris[,-5])
  outp = table(y_pred,iris$Species)
  outp
##             
## y_pred       setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         48         2
##   virginica       0          2        48

The svm function has a built in cross validation option, that great! But I want a workflow where I can use CV with any model? That’s why we use caret:

# A much better approach yet using the caret workbench
cvCtrl = caret::trainControl(method = "cv",
                             number = 10,
                             classProbs =  TRUE
                             )

caret_svm_model <-
      caret::train(Species~.,
                   data = iris,
                   method = "svmLinear3",
                   trControl = cvCtrl
                   )
## Warning in train.default(x, y, weights = w, ...): Class probabilities were
## requested for a model that does not implement them
caret_svm_model %>% plot

varImp(caret_svm_model) %>% plot

caret_svm_model$results
##   cost Loss  Accuracy Kappa AccuracySD    KappaSD
## 1 0.25   L1 0.9600000  0.94 0.04661373 0.06992059
## 2 0.25   L2 0.9000000  0.85 0.06478835 0.09718253
## 3 0.50   L1 0.9600000  0.94 0.04661373 0.06992059
## 4 0.50   L2 0.9466667  0.92 0.06126244 0.09189366
## 5 1.00   L1 0.9533333  0.93 0.04499657 0.06749486
## 6 1.00   L2 0.9400000  0.91 0.05837300 0.08755950
confusionMatrix(caret_svm_model)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##             Reference
## Prediction   setosa versicolor virginica
##   setosa       33.3        0.0       0.0
##   versicolor    0.0       30.7       1.3
##   virginica     0.0        2.7      32.0
##                           
##  Accuracy (average) : 0.96

Using the caret work bench we can leverage many powerfull functions and a more robust backend. Thus we get variable importance, properly executed cross-validation in the presence of other pre-processing operations and performance options such as parrallel computing.

Grid Search and Parameter Tuning

In machine learning applications there are 2 types of parameters:
- Learned parameters that the machine learning algorithm will solve intuitively (e.g. coefficients in a regression model)
- Hyper parameters that are chosen by the ML practitioner such as learning rate, tree size, test-train split etc.

One can imporve the performance of a model by finding the optimal combination of hyper-parameters by scanning through the parameter space. Generally speaking there are methods designed to improve the efficiency of finding these optima such as stochastic gradient decent. These and more advanced methods are especially employed in deep-learning models.

As a basic starting point we can do a simple grid search to scan the hyper-parameter space. The idea here is to take every possible permutation of the parameters that can be used and running the model for each parameter in question until the optimal parameters are found for a scoring metric such as accuracy.

Let’s repeat the classification using svm but now we try to find the optimal hyper-parameters

library(caret)

grid <- expand.grid(cost=seq(0.1,0.9,length.out = 10), Loss=seq(0.1,1,length.out = 10))

cvCtrl = caret::trainControl(method = "cv",
                             number = 10,
                             savePredictions = T)

caret_svm_model <-
      caret::train(data = iris, Species~.,
                   method = "svmLinear3",
                   trControl = cvCtrl,
                   tuneGrid=grid
                   )

caret_svm_model %>% plot()

varImp(caret_svm_model) %>% plot

OK, lets see how all these variables were tested and the results (large table, all premutations)

caret_svm_model$results
##          cost Loss  Accuracy Kappa AccuracySD    KappaSD
## 1   0.1000000  0.1 0.9600000  0.94 0.05621827 0.08432740
## 2   0.1000000  0.2 0.9600000  0.94 0.05621827 0.08432740
## 3   0.1000000  0.3 0.9600000  0.94 0.05621827 0.08432740
## 4   0.1000000  0.4 0.9600000  0.94 0.05621827 0.08432740
## 5   0.1000000  0.5 0.9600000  0.94 0.05621827 0.08432740
## 6   0.1000000  0.6 0.9600000  0.94 0.05621827 0.08432740
## 7   0.1000000  0.7 0.9600000  0.94 0.05621827 0.08432740
## 8   0.1000000  0.8 0.9600000  0.94 0.05621827 0.08432740
## 9   0.1000000  0.9 0.9600000  0.94 0.05621827 0.08432740
## 10  0.1000000  1.0 0.9600000  0.94 0.05621827 0.08432740
## 11  0.1888889  0.1 0.9600000  0.94 0.05621827 0.08432740
## 12  0.1888889  0.2 0.9600000  0.94 0.05621827 0.08432740
## 13  0.1888889  0.3 0.9600000  0.94 0.05621827 0.08432740
## 14  0.1888889  0.4 0.9600000  0.94 0.05621827 0.08432740
## 15  0.1888889  0.5 0.9600000  0.94 0.05621827 0.08432740
## 16  0.1888889  0.6 0.9600000  0.94 0.05621827 0.08432740
## 17  0.1888889  0.7 0.9600000  0.94 0.05621827 0.08432740
## 18  0.1888889  0.8 0.9600000  0.94 0.05621827 0.08432740
## 19  0.1888889  0.9 0.9600000  0.94 0.05621827 0.08432740
## 20  0.1888889  1.0 0.9600000  0.94 0.05621827 0.08432740
## 21  0.2777778  0.1 0.9533333  0.93 0.06324555 0.09486833
## 22  0.2777778  0.2 0.9533333  0.93 0.06324555 0.09486833
## 23  0.2777778  0.3 0.9533333  0.93 0.06324555 0.09486833
## 24  0.2777778  0.4 0.9533333  0.93 0.06324555 0.09486833
## 25  0.2777778  0.5 0.9533333  0.93 0.06324555 0.09486833
## 26  0.2777778  0.6 0.9533333  0.93 0.06324555 0.09486833
## 27  0.2777778  0.7 0.9533333  0.93 0.06324555 0.09486833
## 28  0.2777778  0.8 0.9533333  0.93 0.06324555 0.09486833
## 29  0.2777778  0.9 0.9533333  0.93 0.06324555 0.09486833
## 30  0.2777778  1.0 0.9533333  0.93 0.06324555 0.09486833
## 31  0.3666667  0.1 0.9533333  0.93 0.05488484 0.08232726
## 32  0.3666667  0.2 0.9533333  0.93 0.05488484 0.08232726
## 33  0.3666667  0.3 0.9533333  0.93 0.05488484 0.08232726
## 34  0.3666667  0.4 0.9533333  0.93 0.05488484 0.08232726
## 35  0.3666667  0.5 0.9533333  0.93 0.05488484 0.08232726
## 36  0.3666667  0.6 0.9533333  0.93 0.05488484 0.08232726
## 37  0.3666667  0.7 0.9533333  0.93 0.05488484 0.08232726
## 38  0.3666667  0.8 0.9533333  0.93 0.05488484 0.08232726
## 39  0.3666667  0.9 0.9533333  0.93 0.05488484 0.08232726
## 40  0.3666667  1.0 0.9533333  0.93 0.05488484 0.08232726
## 41  0.4555556  0.1 0.9533333  0.93 0.05488484 0.08232726
## 42  0.4555556  0.2 0.9533333  0.93 0.05488484 0.08232726
## 43  0.4555556  0.3 0.9533333  0.93 0.05488484 0.08232726
## 44  0.4555556  0.4 0.9533333  0.93 0.05488484 0.08232726
## 45  0.4555556  0.5 0.9533333  0.93 0.05488484 0.08232726
## 46  0.4555556  0.6 0.9533333  0.93 0.05488484 0.08232726
## 47  0.4555556  0.7 0.9533333  0.93 0.05488484 0.08232726
## 48  0.4555556  0.8 0.9533333  0.93 0.05488484 0.08232726
## 49  0.4555556  0.9 0.9533333  0.93 0.05488484 0.08232726
## 50  0.4555556  1.0 0.9533333  0.93 0.05488484 0.08232726
## 51  0.5444444  0.1 0.9466667  0.92 0.06126244 0.09189366
## 52  0.5444444  0.2 0.9466667  0.92 0.06126244 0.09189366
## 53  0.5444444  0.3 0.9466667  0.92 0.06126244 0.09189366
## 54  0.5444444  0.4 0.9466667  0.92 0.06126244 0.09189366
## 55  0.5444444  0.5 0.9466667  0.92 0.06126244 0.09189366
## 56  0.5444444  0.6 0.9466667  0.92 0.06126244 0.09189366
## 57  0.5444444  0.7 0.9466667  0.92 0.06126244 0.09189366
## 58  0.5444444  0.8 0.9466667  0.92 0.06126244 0.09189366
## 59  0.5444444  0.9 0.9466667  0.92 0.06126244 0.09189366
## 60  0.5444444  1.0 0.9466667  0.92 0.06126244 0.09189366
## 61  0.6333333  0.1 0.9466667  0.92 0.06126244 0.09189366
## 62  0.6333333  0.2 0.9466667  0.92 0.06126244 0.09189366
## 63  0.6333333  0.3 0.9466667  0.92 0.06126244 0.09189366
## 64  0.6333333  0.4 0.9466667  0.92 0.06126244 0.09189366
## 65  0.6333333  0.5 0.9466667  0.92 0.06126244 0.09189366
## 66  0.6333333  0.6 0.9466667  0.92 0.06126244 0.09189366
## 67  0.6333333  0.7 0.9466667  0.92 0.06126244 0.09189366
## 68  0.6333333  0.8 0.9466667  0.92 0.06126244 0.09189366
## 69  0.6333333  0.9 0.9466667  0.92 0.06126244 0.09189366
## 70  0.6333333  1.0 0.9466667  0.92 0.06126244 0.09189366
## 71  0.7222222  0.1 0.9466667  0.92 0.06126244 0.09189366
## 72  0.7222222  0.2 0.9466667  0.92 0.06126244 0.09189366
## 73  0.7222222  0.3 0.9466667  0.92 0.06126244 0.09189366
## 74  0.7222222  0.4 0.9466667  0.92 0.06126244 0.09189366
## 75  0.7222222  0.5 0.9466667  0.92 0.06126244 0.09189366
## 76  0.7222222  0.6 0.9466667  0.92 0.06126244 0.09189366
## 77  0.7222222  0.7 0.9466667  0.92 0.06126244 0.09189366
## 78  0.7222222  0.8 0.9466667  0.92 0.06126244 0.09189366
## 79  0.7222222  0.9 0.9466667  0.92 0.06126244 0.09189366
## 80  0.7222222  1.0 0.9466667  0.92 0.06126244 0.09189366
## 81  0.8111111  0.1 0.9466667  0.92 0.06126244 0.09189366
## 82  0.8111111  0.2 0.9466667  0.92 0.06126244 0.09189366
## 83  0.8111111  0.3 0.9466667  0.92 0.06126244 0.09189366
## 84  0.8111111  0.4 0.9466667  0.92 0.06126244 0.09189366
## 85  0.8111111  0.5 0.9466667  0.92 0.06126244 0.09189366
## 86  0.8111111  0.6 0.9466667  0.92 0.06126244 0.09189366
## 87  0.8111111  0.7 0.9466667  0.92 0.06126244 0.09189366
## 88  0.8111111  0.8 0.9466667  0.92 0.06126244 0.09189366
## 89  0.8111111  0.9 0.9466667  0.92 0.06126244 0.09189366
## 90  0.8111111  1.0 0.9466667  0.92 0.06126244 0.09189366
## 91  0.9000000  0.1 0.9466667  0.92 0.06126244 0.09189366
## 92  0.9000000  0.2 0.9466667  0.92 0.06126244 0.09189366
## 93  0.9000000  0.3 0.9466667  0.92 0.06126244 0.09189366
## 94  0.9000000  0.4 0.9466667  0.92 0.06126244 0.09189366
## 95  0.9000000  0.5 0.9466667  0.92 0.06126244 0.09189366
## 96  0.9000000  0.6 0.9466667  0.92 0.06126244 0.09189366
## 97  0.9000000  0.7 0.9466667  0.92 0.06126244 0.09189366
## 98  0.9000000  0.8 0.9466667  0.92 0.06126244 0.09189366
## 99  0.9000000  0.9 0.9466667  0.92 0.06126244 0.09189366
## 100 0.9000000  1.0 0.9466667  0.92 0.06126244 0.09189366
confusionMatrix(caret_svm_model)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##             Reference
## Prediction   setosa versicolor virginica
##   setosa       33.3        0.0       0.0
##   versicolor    0.0       30.0       0.7
##   virginica     0.0        3.3      32.7
##                           
##  Accuracy (average) : 0.96

XGBoost

XGBoost is an algorithm that has recently been dominating the machine learning literature where tabular or structured data is involved.

It uses an implementation of decision trees using gradien boosting designed for speed and performance.

In general the XGBoost algorithm is a more strict and optimized version of the GBM model in R.

data_XGB =
  # data.table::fread("Part_10_Boosting/XGBoost/Churn_Modelling.csv")
  readr::read_csv("../../static/data/intro_ml/Part_10_Boosting/XGBoost/Churn_Modelling.csv")
## Parsed with column specification:
## cols(
##   RowNumber = col_integer(),
##   CustomerId = col_integer(),
##   Surname = col_character(),
##   CreditScore = col_integer(),
##   Geography = col_character(),
##   Gender = col_character(),
##   Age = col_integer(),
##   Tenure = col_integer(),
##   Balance = col_double(),
##   NumOfProducts = col_integer(),
##   HasCrCard = col_integer(),
##   IsActiveMember = col_integer(),
##   EstimatedSalary = col_double(),
##   Exited = col_integer()
## )
data_XGB %>% head
## # A tibble: 6 x 14
##   RowNumber CustomerId Surname  CreditScore Geography Gender   Age Tenure
##       <int>      <int> <chr>          <int> <chr>     <chr>  <int>  <int>
## 1         1   15634602 Hargrave         619 France    Female    42      2
## 2         2   15647311 Hill             608 Spain     Female    41      1
## 3         3   15619304 Onio             502 France    Female    42      8
## 4         4   15701354 Boni             699 France    Female    39      1
## 5         5   15737888 Mitchell         850 Spain     Female    43      2
## 6         6   15574012 Chu              645 Spain     Male      44      8
## # ... with 6 more variables: Balance <dbl>, NumOfProducts <int>,
## #   HasCrCard <int>, IsActiveMember <int>, EstimatedSalary <dbl>,
## #   Exited <int>

Pre-process the data

data_XGB = data_XGB[,4:14]

#encode categorical variables as factors
data_XGB <-
  data_XGB %>%
  mutate(Geography = factor(Geography,
                            levels = c('France','Spain','Germany'),
                            labels = c(1,2,3)) %>% as.numeric(),
         Gender = factor(Gender,
                         levels = c('Female','Male'),
                         labels = c(1,2)) %>% as.numeric()
         )

library(caTools)
set.seed(123)

split = sample.split(data_XGB$Exited, SplitRatio = 0.8)
train = data_XGB[split,]
test = data_XGB[!split,]

no feature scaling required

Train model:

Manually calculate some accuracy statistics since this library does not seem to have relevant summary functions

y_pred = predict(classifier_XGB, newdata = as.matrix(test[,-11])) >=0.5
y_actual = test[,11] %>% flatten_dbl()

cm = table(y_pred,y_actual)
cm
##        y_actual
## y_pred     0    1
##   FALSE 1524  210
##   TRUE    69  197
accuracy =  (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1]) #sum of diagonal over sum of all hits

# in other words we could've used:
accuracy_ = sum(diag(cm)) / (test %>% nrow)
#since this can be used for any arbitrary number of categories
accuracy_
## [1] 0.8605

So interestingly; if I increase the parameter nrounds I can get the R-squared estimate down by a lot, especially since the algorithm is so fast. However, this actually decreases the percieved accuracy of the model. That’s interesting, also why does the model report R-squared anyway, if it is not a relevant measure of accuracy?

Let’s apply our grid_search approach to find a better parameter value:

Manual grid-search approach (1 parameter)

For multiple parameters use expand.grid together with pmap (exponentially taxing)

Fascinating, the actual accuracy goes down the more iterations you use… I expected some sort of cut-off where overfitting happens but its entirely a downward trend. This casts a lot of doubt on the actual validity of the model being used. Not because it’s bad but because it may be being used incorrectly by the people in this course. Alas I don’t know enough about this model to really say whats wrong here, but its suspicious.

The model does however converge on about 1300 iterations with accuracy of roughly ~0.85 and RMSE of 0.004094. Interestingly the max accuracy ~0.87 has model specified RMSE of ~0.5 which could indicate a few outliers skewing the model accuracy when optimized using RMSE.

If this is true the model using gradient descent performs extremely well after only a single iteration. That implies that the default starting parameters for the tree based search happens to be pretty good or the gradient descent arrived at a good solution of hyper-parameters after only a single iteration.