- understanding the nature decision tree algorithm
- estimating decision tree using
`rpart`

function - making predictions based on a decision tree
- over-fitting
`randomForest`

Let's first apply decision trees to a toy data set. We have two variables: `Insp`

which can be either `ok`

or `fraud`

, and `rel_uprice`

which we will use to classify `Insp`

as either `ok`

or `fraud`

.

```
library(tidyverse)
library(caret)
toy <- data.frame(Insp=c("ok","ok","ok", "fraud","ok", "fraud"),
rel_uprice=c(1,2,2,90,100,110))
toy
```

```
## Insp rel_uprice
## 1 ok 1
## 2 ok 2
## 3 ok 2
## 4 fraud 90
## 5 ok 100
## 6 fraud 110
```

Decision tree algorithm uses the predictor variable, `rel_uprice`

to split data set into parts such that each part is relatively homogeneous in terms of the classes it contains. Our data is already sorted by relative unit price. Notice that if we split the data into observations where unit price is greater than 2 we get a split where one part is completely homogeneous (all `ok`

transactions) and the other part is two thirds fraudulent. So, a reasonable decision tree would be to classify all transactions that have price less than, say, 46 as `ok`

and others as `fraud`

.

Let's see what the `rpart`

algorithm thinks. `rpart`

is one of a number of decision tree algorithms available for R. (Lantz discusses `C5.0`

). I like `rpart`

because it has nice visualization feature, `rpart.plot`

described here. Plus, its syntax is consistent with other methods such as `randomForest`

. The first argument to `rpart`

is a formula in the familiar format of `Y ~ X1 + X2`

where `Y`

is the variable we are trying to predict, `X1`

, `X2`

are predictor variables. The second argument is data, and third contains options that let us control the complexity of the tree. For example, `minsplit`

specifies the minimum number of observations for the algorithm to attempt a split. The lower the number, the more splits will be attempted, and the larger, more complex the tree.

```
library(rpart)
library(rpart.plot)
tree <- rpart(Insp ~ rel_uprice, data=toy, minsplit=3)
summary(tree)
```

```
## Call:
## rpart(formula = Insp ~ rel_uprice, data = toy, minsplit = 3)
## n= 6
##
## CP nsplit rel error xerror xstd
## 1 0.50 0 1.0 1.0 0.5773503
## 2 0.01 1 0.5 1.5 0.6123724
##
## Variable importance
## rel_uprice
## 100
##
## Node number 1: 6 observations, complexity param=0.5
## predicted class=ok expected loss=0.3333333 P(node) =1
## class counts: 2 4
## probabilities: 0.333 0.667
## left son=2 (3 obs) right son=3 (3 obs)
## Primary splits:
## rel_uprice < 46 to the right, improve=1.333333, (0 missing)
##
## Node number 2: 3 observations
## predicted class=fraud expected loss=0.3333333 P(node) =0.5
## class counts: 2 1
## probabilities: 0.667 0.333
##
## Node number 3: 3 observations
## predicted class=ok expected loss=0 P(node) =0.5
## class counts: 0 3
## probabilities: 0.000 1.000
```

`rpart.plot(tree)`

Well, we were right. The algorithm says that if price is greater than or equal to 46, the transaction is classified as `fraud`

, if it is less than 46, the transaction is classified as `ok`

. Thus, we have a tree with two branches. In the left branch observations are classified as `fraud`

.

In each node we have three pieces of information. First, we see the class label telling us how the observation in that node are classified (`ok`

vs `fraud`

). Second, we see the probability of the class being predicted (in our case `ok`

) in that branch of the tree - it is the fraction of `ok`

observations in that branch. For example, in the left branch (labeled `fraud`

) the probability of `ok`

transaction is 33%. In the right branch (labeled `ok`

) the probability of `ok`

transaction is 100%. The last number is the percentage of observations in that node. In our case the observations are split 50/50 between the two nodes.

IN-CLASS EXERCISE 1: Take a look at the data below. What do you think the decision tree should look like if we use `rel_uprice`

and `missing`

as predictors? How did the algorithm handle the missing values of `rel_uprice`

? How would the logistic regression handle it?

IN-CLASS EXERCISE 2: Suppose `rel_uprice`

was constant. What would the algorithm do?

Suppose we have three new transactions where we don't know whether they are `ok`

or `fraud`

, but we know the `rel_uprice`

associated with each transaction. We can have function `predict()`

plug these values into our original decision tree, `tree`

and predict will tell us how the tree classified these transactions:

```
test <- data.frame(rel_uprice=c(200,20,NA))
test$Insp_predicted <- predict(tree, test, type="class")
test
```

```
## rel_uprice Insp_predicted
## 1 200 fraud
## 2 20 ok
## 3 NA ok
```

Let's load data on sales reports from lab 11. Here we will use only transactions that have been inspected. Let's turn `Insp`

into a factor because that is what `rpart`

will be looking for. Let's also make `missing`

a factor since that is what `randomForest`

will be looking for in section 7.

```
inspected <- read_csv("inspected.csv")
inspected$Insp <- as.factor(inspected$Insp)
inspected$missing <- as.factor(inspected$missing)
```

Let's now create a training and testing data sets. We will use function `sample`

as we did in the past. We will use at 80/20 split.

```
set.seed(364) #set the seed for reproducibility
sample <- sample(nrow(inspected), round(nrow(inspected)*0.8))
train <- inspected[sample,]
test <- inspected[-sample,]
```

Since fraudulent transaction are pretty infrequent, and our inspected data set is not particularly large, we should check that we have roughly the same proportion of fraudulent transactions in both test and train data sets.

`prop.table(table(train$Insp))`

```
##
## fraud ok
## 0.08072461 0.91927539
```

`prop.table(table(test$Insp))`

```
##
## fraud ok
## 0.08073744 0.91926256
```

Let's use `rel_uprice`

and `missing`

as predictors of `Insp`

. We will *not* include the `minsplit`

option as we did above because we are no longer working with a "toy" data set. This leaves it at the default value of 20.

```
tree <- rpart(Insp ~ rel_uprice + missing, data=train)
summary(tree)
```

```
## Call:
## rpart(formula = Insp ~ rel_uprice + missing, data = train)
## n= 12586
##
## CP nsplit rel error xerror xstd
## 1 0.29232283 0 1.0000000 1.0000000 0.03007987
## 2 0.23425197 1 0.7076772 0.7096457 0.02566043
## 3 0.02854331 2 0.4734252 0.4881890 0.02148406
## 4 0.01000000 3 0.4448819 0.4596457 0.02087153
##
## Variable importance
## rel_uprice missing
## 95 5
##
## Node number 1: 12586 observations, complexity param=0.2923228
## predicted class=ok expected loss=0.08072461 P(node) =1
## class counts: 1016 11570
## probabilities: 0.081 0.919
## left son=2 (459 obs) right son=3 (12127 obs)
## Primary splits:
## rel_uprice < -154.9215 to the left, improve=531.17180, (149 missing)
## missing splits as RRRL, improve= 48.77332, (0 missing)
##
## Node number 2: 459 observations
## predicted class=fraud expected loss=0.1764706 P(node) =0.03646909
## class counts: 378 81
## probabilities: 0.824 0.176
##
## Node number 3: 12127 observations, complexity param=0.234252
## predicted class=ok expected loss=0.05260988 P(node) =0.9635309
## class counts: 638 11489
## probabilities: 0.053 0.947
## left son=6 (342 obs) right son=7 (11785 obs)
## Primary splits:
## rel_uprice < 180.2213 to the right, improve=450.0984, (149 missing)
## missing splits as RRRL, improve= 52.1244, (0 missing)
##
## Node number 6: 342 observations
## predicted class=fraud expected loss=0.1520468 P(node) =0.02717305
## class counts: 290 52
## probabilities: 0.848 0.152
##
## Node number 7: 11785 observations, complexity param=0.02854331
## predicted class=ok expected loss=0.02952906 P(node) =0.9363579
## class counts: 348 11437
## probabilities: 0.030 0.970
## left son=14 (35 obs) right son=15 (11750 obs)
## Primary splits:
## missing splits as RRRL, improve=54.95882, (0 missing)
## rel_uprice < 156.4715 to the right, improve=39.57389, (149 missing)
##
## Node number 14: 35 observations
## predicted class=fraud expected loss=0.08571429 P(node) =0.002780868
## class counts: 32 3
## probabilities: 0.914 0.086
##
## Node number 15: 11750 observations
## predicted class=ok expected loss=0.02689362 P(node) =0.933577
## class counts: 316 11434
## probabilities: 0.027 0.973
```

`rpart.plot(tree)`

We grew a tree with 4 branches. It tells us that if price is less than negative 155, classify transaction as `fraud`

. If not, ask if price is greater than or equal to 180. If yes, classify a transaction as `fraud`

. If not, ask if `missing`

is equal to "Val missing". If yes, classify transaction as `fraud`

, or else classify transaction as `ok`

. In short, if the price is really high or really low, transaction is a `fraud`

. Otherwise transaction is `fraud`

only if `Val`

is missing. That is a pretty reasonable decision tree.

Let's examine the information in the output. In the leftmost bottom bucket we see that 4% of transactions have price below negative 155. They are all classified as `fraud`

even though 17% of these transactions are actually `ok`

. In the rightmost bottom bucket we see that about 94% of transactions in the `train`

data are classified as `ok`

. Of those transactions, 97% are actually `ok`

.

Let's see how it actually does in classifying transactions in the `test`

data set.

```
test$Insp_pred <- predict(tree, test, type="class")
confusionMatrix(test$Insp_pred, test$Insp)
```

```
## Confusion Matrix and Statistics
##
## Reference
## Prediction fraud ok
## fraud 166 31
## ok 88 2861
##
## Accuracy : 0.9622
## 95% CI : (0.9549, 0.9686)
## No Information Rate : 0.9193
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7161
##
## Mcnemar's Test P-Value : 2.844e-07
##
## Sensitivity : 0.65354
## Specificity : 0.98928
## Pos Pred Value : 0.84264
## Neg Pred Value : 0.97016
## Prevalence : 0.08074
## Detection Rate : 0.05277
## Detection Prevalence : 0.06262
## Balanced Accuracy : 0.82141
##
## 'Positive' Class : fraud
##
```

We achieved 96% accuracy which is better than 92% which we would get if declared every transaction as `ok`

. Amazingly, Kappa is 0.72 suggesting that our predictors make a lot of difference.

IN-CLASS EXERCISE 3: Estimate decision tree using only `rel_uprice`

as the sole predictor. Does the tree do better than our model above?

IN-CLASS EXERCISE 4: Estimate decision tree using only `Val`

and `Quant`

as predictors. Does the tree do better than the our model above?

IN-CLASS EXERCISE 5: Estimate a logistic regression using `rel_uprice`

as a predictor. Use 0.93 probablity as a cutoff for classifying transaction as `ok`

.How does this mode do?

```
logit <- glm(Insp ~ rel_uprice, family="binomial", data=train)
summary(logit)
```

```
##
## Call:
## glm(formula = Insp ~ rel_uprice, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4281 0.3680 0.4001 0.4168 0.4927
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.4686713 0.0336694 73.321 < 2e-16 ***
## rel_uprice 0.0021298 0.0003983 5.347 8.95e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6752.1 on 12436 degrees of freedom
## Residual deviance: 6723.3 on 12435 degrees of freedom
## (149 observations deleted due to missingness)
## AIC: 6727.3
##
## Number of Fisher Scoring iterations: 5
```

```
test$Insp_pred_logit_prob <- predict(logit, test, type="response")
test$Insp_pred_logit_class <- as.factor(ifelse(test$Insp_pred_logit_prob>0.93,"ok","fraud"))
summary(test)
```

```
## ID Prod Quant Val
## Length:3146 Length:3146 Min. : 100 Min. : 1005
## Class :character Class :character 1st Qu.: 111 1st Qu.: 1251
## Mode :character Mode :character Median : 434 Median : 12698
## Mean : 42093 Mean : 63518
## 3rd Qu.: 5766 3rd Qu.: 54239
## Max. :14645544 Max. :4616735
## NA's :26 NA's :12
## Insp av_uprice med_uprice rel_uprice
## fraud: 254 Min. : 0.0616 Min. : 0.0174 Min. :-199.34926
## ok :2892 1st Qu.: 9.6124 1st Qu.: 8.4807 1st Qu.: -41.62321
## Median : 16.0065 Median : 12.1429 Median : 0.00506
## Mean : 22.9048 Mean : 15.3369 Mean : 13.79070
## 3rd Qu.: 21.5857 3rd Qu.: 16.3366 3rd Qu.: 75.65502
## Max. :684.3182 Max. :329.3137 Max. : 199.54365
## NA's :37
## missing Insp_pred Insp_pred_logit_prob Insp_pred_logit_class
## both missing : 1 fraud: 197 Min. :0.8853 fraud:2166
## no missing :3109 ok :2949 1st Qu.:0.9153 ok : 943
## Quant missing: 25 Median :0.9219 NA's : 37
## Val missing : 11 Mean :0.9231
## 3rd Qu.:0.9327
## Max. :0.9475
## NA's :37
```

`confusionMatrix(test$Insp_pred_logit_class, test$Insp)`

```
## Confusion Matrix and Statistics
##
## Reference
## Prediction fraud ok
## fraud 126 2040
## ok 115 828
##
## Accuracy : 0.3069
## 95% CI : (0.2907, 0.3234)
## No Information Rate : 0.9225
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0405
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.52282
## Specificity : 0.28870
## Pos Pred Value : 0.05817
## Neg Pred Value : 0.87805
## Prevalence : 0.07752
## Detection Rate : 0.04053
## Detection Prevalence : 0.69669
## Balanced Accuracy : 0.40576
##
## 'Positive' Class : fraud
##
```

IN-CLASS EXERCISE 6: Estimate a logistic regression using `rel_uprice`

and `rel_uprice`

squared as a predictor. Use 0.93 probablity as a cutoff for classifying transaction as `ok`

. How does this mode do?

```
train <- train %>%
mutate(rel_uprice_sq = rel_uprice*rel_uprice)
test <- test %>%
mutate(rel_uprice_sq = rel_uprice*rel_uprice)
logit <- glm(Insp ~ rel_uprice + rel_uprice_sq, family="binomial", data=train)
summary(logit)
```

```
##
## Call:
## glm(formula = Insp ~ rel_uprice + rel_uprice_sq, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4027 0.0809 0.0970 0.1549 2.5341
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.760e+00 1.214e-01 47.44 <2e-16 ***
## rel_uprice 4.789e-03 3.693e-04 12.97 <2e-16 ***
## rel_uprice_sq -2.043e-04 4.894e-06 -41.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6752.1 on 12436 degrees of freedom
## Residual deviance: 2838.2 on 12434 degrees of freedom
## (149 observations deleted due to missingness)
## AIC: 2844.2
##
## Number of Fisher Scoring iterations: 7
```

```
test$Insp_pred_logit_prob <- predict(logit, test, type="response")
test$Insp_pred_logit_class <- as.factor(ifelse(test$Insp_pred_logit_prob>0.93,"ok","fraud"))
summary(test)
```

```
## ID Prod Quant Val
## Length:3146 Length:3146 Min. : 100 Min. : 1005
## Class :character Class :character 1st Qu.: 111 1st Qu.: 1251
## Mode :character Mode :character Median : 434 Median : 12698
## Mean : 42093 Mean : 63518
## 3rd Qu.: 5766 3rd Qu.: 54239
## Max. :14645544 Max. :4616735
## NA's :26 NA's :12
## Insp av_uprice med_uprice rel_uprice
## fraud: 254 Min. : 0.0616 Min. : 0.0174 Min. :-199.34926
## ok :2892 1st Qu.: 9.6124 1st Qu.: 8.4807 1st Qu.: -41.62321
## Median : 16.0065 Median : 12.1429 Median : 0.00506
## Mean : 22.9048 Mean : 15.3369 Mean : 13.79070
## 3rd Qu.: 21.5857 3rd Qu.: 16.3366 3rd Qu.: 75.65502
## Max. :684.3182 Max. :329.3137 Max. : 199.54365
## NA's :37
## missing Insp_pred Insp_pred_logit_prob Insp_pred_logit_class
## both missing : 1 fraud: 197 Min. :0.03513 fraud: 446
## no missing :3109 ok :2949 1st Qu.:0.98125 ok :2663
## Quant missing: 25 Median :0.99419 NA's : 37
## Val missing : 11 Mean :0.92845
## 3rd Qu.:0.99650
## Max. :0.99695
## NA's :37
## rel_uprice_sq
## Min. : 0.0
## 1st Qu.: 505.4
## Median : 2748.8
## Mean : 7216.5
## 3rd Qu.: 9779.2
## Max. :39817.7
## NA's :37
```

`confusionMatrix(test$Insp_pred_logit_class, test$Insp)`

```
## Confusion Matrix and Statistics
##
## Reference
## Prediction fraud ok
## fraud 209 237
## ok 32 2631
##
## Accuracy : 0.9135
## 95% CI : (0.903, 0.9231)
## No Information Rate : 0.9225
## P-Value [Acc > NIR] : 0.9705
##
## Kappa : 0.5646
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.86722
## Specificity : 0.91736
## Pos Pred Value : 0.46861
## Neg Pred Value : 0.98798
## Prevalence : 0.07752
## Detection Rate : 0.06722
## Detection Prevalence : 0.14345
## Balanced Accuracy : 0.89229
##
## 'Positive' Class : fraud
##
```

Over-fitting is the tendency for algorithms to fit patterns that exist in the `train`

data but not in the `test`

data. This is particularly relevant when using decision trees. In principle, we could grow a really big tree that accurately classifies every single observation in the `train`

data. However, it is likely that some of these patterns captured in that tree would be specific to only the `train`

data and would do a poor job predicting classes in the `test`

data. The fundamental challenge in estimating a model is to find patterns that are enduring and hold true in any data. This is the reason why we evaluate our model predictions on a `test`

data, i.e. data that our algorithm did not see. If our model does poorly on `test`

data, it means that we likely over-fit our model, i.e. we fit patterns that exist in the `train`

data but not in the `test`

data.

One method that tries to avoid over-fitting is *random forest*. It is a popular variation on decision trees. It falls into the category of *ensemble* methods because it utilizes results from multiple algorithms. Random forest utilizes results from an estimation of multiple decision trees. These decision trees are estimated using different samples of the training data. This repeated sampling from the `train`

data means that random forest picks up patterns that exist in lots of different samples of the training data. This reduces the chances of over-fitting. When making a prediction, the values of the predictor variables are fed to these trees. Different trees may classify an observation differently but the final `class`

is the one that is the most common.

We will estimate random forest using `randomForest`

function from the package of the same name. The function requires that all categorical variables are specified as factors. We also need to make sure that the factor levels are the same in `test`

and `train`

data. Random forest method is particularly effective when we have a large number of predictors (because it tries different random set of predictors at each split and each tree). Therefore, we will include `Val`

and `Quant`

in addition to `rel_uprice`

and `missing`

.

One drawback of `randomForest`

is that it does not handle missing values. Therefore, before we can run it, we need to impute values to any variables and any observations that have `NA`

s. Below I choose to replace `NA`

s in `rel_uprice`

with zero, and `NA`

s in `Val`

and `Quant`

with their respective means in `train`

and `test`

data.

```
library(randomForest)
train <- train %>%
mutate(rel_uprice = ifelse(is.na(rel_uprice),0,rel_uprice),
Val = ifelse(is.na(Val),mean(Val, na.rm = TRUE),Val),
Quant = ifelse(is.na(Quant),mean(Quant, na.rm = TRUE),Quant))
test <- test %>%
mutate(rel_uprice = ifelse(is.na(rel_uprice),0,rel_uprice),
Val = ifelse(is.na(Val),mean(Val, na.rm = TRUE),Val),
Quant = ifelse(is.na(Quant),mean(Quant, na.rm = TRUE),Quant))
rf <- randomForest(Insp ~ rel_uprice + missing + Val + Quant, data = train)
rf
```

```
##
## Call:
## randomForest(formula = Insp ~ rel_uprice + missing + Val + Quant, data = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 2.44%
## Confusion matrix:
## fraud ok class.error
## fraud 795 221 0.217519685
## ok 86 11484 0.007433016
```

```
test$Insp_pred_rf <- predict(rf, test, type="class")
confusionMatrix(test$Insp_pred_rf, test$Insp)
```

```
## Confusion Matrix and Statistics
##
## Reference
## Prediction fraud ok
## fraud 195 19
## ok 59 2873
##
## Accuracy : 0.9752
## 95% CI : (0.9692, 0.9804)
## No Information Rate : 0.9193
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.82
##
## Mcnemar's Test P-Value : 1.006e-05
##
## Sensitivity : 0.76772
## Specificity : 0.99343
## Pos Pred Value : 0.91121
## Neg Pred Value : 0.97988
## Prevalence : 0.08074
## Detection Rate : 0.06198
## Detection Prevalence : 0.06802
## Balanced Accuracy : 0.88057
##
## 'Positive' Class : fraud
##
```

We have Kappa of 0.86 and super-high specificities and sensitivities. Random forest is a substantial improvement over single decision tree algorithms we used above.

- Load the
`titanic_train.csv`

data again. It has the following variables:

`Survived`

Survival (0 = No; 1 = Yes)`Pclass`

Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)`Name`

Name`Sex`

Sex`Age`

Age (in years)`SibSp`

Number of Siblings/Spouses Aboard`Parch`

Number of Parents/Children Aboard`Ticket`

Ticket Number`Fare`

Passenger Fare`Cabin`

Cabin`Embarked`

Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Which variables do you think may be good predictors of the survival on the Titanic? Create two or three graphs similar to those in lab 11 (the data exploration lab). (Hint1: You may want to turn the

`Survived`

variable into a factor using the`factor()`

function. Hint2: You may want to add`+ geom_jitter()`

to plots involving qualitative variables.)Estimate a decision tree predicting survival using age and sex as predictors. Describe your results.

Estimate a decision tree using age, sex and passenger class. Describe your results. Use option

`minsplit=50`

to prune the tree.Estimate your own decision tree with your own set of predictors (you are, of course, free to include the predictors we used above).

Download

`titanic_test.csv`

from Nexus. This is a`test`

data from Kaggle. It has information on 418 passengers except we actually don't know whether they survived or not. Use your favorite model from above to predict which of these passengers survived. Show your predictions for the first 6 passengers.Even though

*we*don't know the fate of the passengers in the test data set, Kaggle does. In fact, Kaggle will evaluate our predictions and tell us how many we got right. Moreover, Kaggle will compare the accuracy of our predictions to those of other participants in the competition. All we have to do is register with Kaggle and create a .csv file that contains two columns:`PassengerId`

and`Survived`

, where`Survived`

contains our predictions: 0 for did not survive, and 1 for survived. You can use`write_csv()`

to make that`.csv`

file (make sure to`select()`

only`PassengerId`

and`Survived`

. Submit your predictions to Kaggle and report on your accuracy and rank compared to other participants. You can make up to 5 submissions per day if you would like to experiment with different models, and improve your submission. Feel free to try`randomForest`

. Take a screenshot of your submission result, and attach it to your lab. I will pay five dollars to the team with the highest rank. Good luck!