rpart
functionrandomForest
Let's first apply decision trees to a toy data set. We have two variables: Insp
which can be either ok
or fraud
, and rel_uprice
which we will use to classify Insp
as either ok
or fraud
.
library(tidyverse)
library(caret)
toy <- data.frame(Insp=c("ok","ok","ok", "fraud","ok", "fraud"),
rel_uprice=c(1,2,2,90,100,110))
toy
## Insp rel_uprice
## 1 ok 1
## 2 ok 2
## 3 ok 2
## 4 fraud 90
## 5 ok 100
## 6 fraud 110
Decision tree algorithm uses the predictor variable, rel_uprice
to split data set into parts such that each part is relatively homogeneous in terms of the classes it contains. Our data is already sorted by relative unit price. Notice that if we split the data into observations where unit price is greater than 2 we get a split where one part is completely homogeneous (all ok
transactions) and the other part is two thirds fraudulent. So, a reasonable decision tree would be to classify all transactions that have price less than, say, 46 as ok
and others as fraud
.
Let's see what the rpart
algorithm thinks. rpart
is one of a number of decision tree algorithms available for R. (Lantz discusses C5.0
). I like rpart
because it has nice visualization feature, rpart.plot
described here. Plus, its syntax is consistent with other methods such as randomForest
. The first argument to rpart
is a formula in the familiar format of Y ~ X1 + X2
where Y
is the variable we are trying to predict, X1
, X2
are predictor variables. The second argument is data, and third contains options that let us control the complexity of the tree. For example, minsplit
specifies the minimum number of observations for the algorithm to attempt a split. The lower the number, the more splits will be attempted, and the larger, more complex the tree.
library(rpart)
library(rpart.plot)
tree <- rpart(Insp ~ rel_uprice, data=toy, minsplit=3)
summary(tree)
## Call:
## rpart(formula = Insp ~ rel_uprice, data = toy, minsplit = 3)
## n= 6
##
## CP nsplit rel error xerror xstd
## 1 0.50 0 1.0 1.0 0.5773503
## 2 0.01 1 0.5 1.5 0.6123724
##
## Variable importance
## rel_uprice
## 100
##
## Node number 1: 6 observations, complexity param=0.5
## predicted class=ok expected loss=0.3333333 P(node) =1
## class counts: 2 4
## probabilities: 0.333 0.667
## left son=2 (3 obs) right son=3 (3 obs)
## Primary splits:
## rel_uprice < 46 to the right, improve=1.333333, (0 missing)
##
## Node number 2: 3 observations
## predicted class=fraud expected loss=0.3333333 P(node) =0.5
## class counts: 2 1
## probabilities: 0.667 0.333
##
## Node number 3: 3 observations
## predicted class=ok expected loss=0 P(node) =0.5
## class counts: 0 3
## probabilities: 0.000 1.000
rpart.plot(tree)
Well, we were right. The algorithm says that if price is greater than or equal to 46, the transaction is classified as fraud
, if it is less than 46, the transaction is classified as ok
. Thus, we have a tree with two branches. In the left branch observations are classified as fraud
.
In each node we have three pieces of information. First, we see the class label telling us how the observation in that node are classified (ok
vs fraud
). Second, we see the probability of the class being predicted (in our case ok
) in that branch of the tree - it is the fraction of ok
observations in that branch. For example, in the left branch (labeled fraud
) the probability of ok
transaction is 33%. In the right branch (labeled ok
) the probability of ok
transaction is 100%. The last number is the percentage of observations in that node. In our case the observations are split 50/50 between the two nodes.
IN-CLASS EXERCISE 1: Take a look at the data below. What do you think the decision tree should look like if we use rel_uprice
and missing
as predictors? How did the algorithm handle the missing values of rel_uprice
? How would the logistic regression handle it?
IN-CLASS EXERCISE 2: Suppose rel_uprice
was constant. What would the algorithm do?
Suppose we have three new transactions where we don't know whether they are ok
or fraud
, but we know the rel_uprice
associated with each transaction. We can have function predict()
plug these values into our original decision tree, tree
and predict will tell us how the tree classified these transactions:
test <- data.frame(rel_uprice=c(200,20,NA))
test$Insp_predicted <- predict(tree, test, type="class")
test
## rel_uprice Insp_predicted
## 1 200 fraud
## 2 20 ok
## 3 NA ok
Let's load data on sales reports from lab 11. Here we will use only transactions that have been inspected. Let's turn Insp
into a factor because that is what rpart
will be looking for. Let's also make missing
a factor since that is what randomForest
will be looking for in section 7.
inspected <- read_csv("inspected.csv")
inspected$Insp <- as.factor(inspected$Insp)
inspected$missing <- as.factor(inspected$missing)
Let's now create a training and testing data sets. We will use function sample
as we did in the past. We will use at 80/20 split.
set.seed(364) #set the seed for reproducibility
sample <- sample(nrow(inspected), round(nrow(inspected)*0.8))
train <- inspected[sample,]
test <- inspected[-sample,]
Since fraudulent transaction are pretty infrequent, and our inspected data set is not particularly large, we should check that we have roughly the same proportion of fraudulent transactions in both test and train data sets.
prop.table(table(train$Insp))
##
## fraud ok
## 0.08072461 0.91927539
prop.table(table(test$Insp))
##
## fraud ok
## 0.08073744 0.91926256
Let's use rel_uprice
and missing
as predictors of Insp
. We will not include the minsplit
option as we did above because we are no longer working with a "toy" data set. This leaves it at the default value of 20.
tree <- rpart(Insp ~ rel_uprice + missing, data=train)
summary(tree)
## Call:
## rpart(formula = Insp ~ rel_uprice + missing, data = train)
## n= 12586
##
## CP nsplit rel error xerror xstd
## 1 0.29232283 0 1.0000000 1.0000000 0.03007987
## 2 0.23425197 1 0.7076772 0.7096457 0.02566043
## 3 0.02854331 2 0.4734252 0.4881890 0.02148406
## 4 0.01000000 3 0.4448819 0.4596457 0.02087153
##
## Variable importance
## rel_uprice missing
## 95 5
##
## Node number 1: 12586 observations, complexity param=0.2923228
## predicted class=ok expected loss=0.08072461 P(node) =1
## class counts: 1016 11570
## probabilities: 0.081 0.919
## left son=2 (459 obs) right son=3 (12127 obs)
## Primary splits:
## rel_uprice < -154.9215 to the left, improve=531.17180, (149 missing)
## missing splits as RRRL, improve= 48.77332, (0 missing)
##
## Node number 2: 459 observations
## predicted class=fraud expected loss=0.1764706 P(node) =0.03646909
## class counts: 378 81
## probabilities: 0.824 0.176
##
## Node number 3: 12127 observations, complexity param=0.234252
## predicted class=ok expected loss=0.05260988 P(node) =0.9635309
## class counts: 638 11489
## probabilities: 0.053 0.947
## left son=6 (342 obs) right son=7 (11785 obs)
## Primary splits:
## rel_uprice < 180.2213 to the right, improve=450.0984, (149 missing)
## missing splits as RRRL, improve= 52.1244, (0 missing)
##
## Node number 6: 342 observations
## predicted class=fraud expected loss=0.1520468 P(node) =0.02717305
## class counts: 290 52
## probabilities: 0.848 0.152
##
## Node number 7: 11785 observations, complexity param=0.02854331
## predicted class=ok expected loss=0.02952906 P(node) =0.9363579
## class counts: 348 11437
## probabilities: 0.030 0.970
## left son=14 (35 obs) right son=15 (11750 obs)
## Primary splits:
## missing splits as RRRL, improve=54.95882, (0 missing)
## rel_uprice < 156.4715 to the right, improve=39.57389, (149 missing)
##
## Node number 14: 35 observations
## predicted class=fraud expected loss=0.08571429 P(node) =0.002780868
## class counts: 32 3
## probabilities: 0.914 0.086
##
## Node number 15: 11750 observations
## predicted class=ok expected loss=0.02689362 P(node) =0.933577
## class counts: 316 11434
## probabilities: 0.027 0.973
rpart.plot(tree)
We grew a tree with 4 branches. It tells us that if price is less than negative 155, classify transaction as fraud
. If not, ask if price is greater than or equal to 180. If yes, classify a transaction as fraud
. If not, ask if missing
is equal to "Val missing". If yes, classify transaction as fraud
, or else classify transaction as ok
. In short, if the price is really high or really low, transaction is a fraud
. Otherwise transaction is fraud
only if Val
is missing. That is a pretty reasonable decision tree.
Let's examine the information in the output. In the leftmost bottom bucket we see that 4% of transactions have price below negative 155. They are all classified as fraud
even though 17% of these transactions are actually ok
. In the rightmost bottom bucket we see that about 94% of transactions in the train
data are classified as ok
. Of those transactions, 97% are actually ok
.
Let's see how it actually does in classifying transactions in the test
data set.
test$Insp_pred <- predict(tree, test, type="class")
confusionMatrix(test$Insp_pred, test$Insp)
## Confusion Matrix and Statistics
##
## Reference
## Prediction fraud ok
## fraud 166 31
## ok 88 2861
##
## Accuracy : 0.9622
## 95% CI : (0.9549, 0.9686)
## No Information Rate : 0.9193
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7161
##
## Mcnemar's Test P-Value : 2.844e-07
##
## Sensitivity : 0.65354
## Specificity : 0.98928
## Pos Pred Value : 0.84264
## Neg Pred Value : 0.97016
## Prevalence : 0.08074
## Detection Rate : 0.05277
## Detection Prevalence : 0.06262
## Balanced Accuracy : 0.82141
##
## 'Positive' Class : fraud
##
We achieved 96% accuracy which is better than 92% which we would get if declared every transaction as ok
. Amazingly, Kappa is 0.72 suggesting that our predictors make a lot of difference.
IN-CLASS EXERCISE 3: Estimate decision tree using only rel_uprice
as the sole predictor. Does the tree do better than our model above?
IN-CLASS EXERCISE 4: Estimate decision tree using only Val
and Quant
as predictors. Does the tree do better than the our model above?
IN-CLASS EXERCISE 5: Estimate a logistic regression using rel_uprice
as a predictor. Use 0.93 probablity as a cutoff for classifying transaction as ok
.How does this mode do?
logit <- glm(Insp ~ rel_uprice, family="binomial", data=train)
summary(logit)
##
## Call:
## glm(formula = Insp ~ rel_uprice, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4281 0.3680 0.4001 0.4168 0.4927
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.4686713 0.0336694 73.321 < 2e-16 ***
## rel_uprice 0.0021298 0.0003983 5.347 8.95e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6752.1 on 12436 degrees of freedom
## Residual deviance: 6723.3 on 12435 degrees of freedom
## (149 observations deleted due to missingness)
## AIC: 6727.3
##
## Number of Fisher Scoring iterations: 5
test$Insp_pred_logit_prob <- predict(logit, test, type="response")
test$Insp_pred_logit_class <- as.factor(ifelse(test$Insp_pred_logit_prob>0.93,"ok","fraud"))
summary(test)
## ID Prod Quant Val
## Length:3146 Length:3146 Min. : 100 Min. : 1005
## Class :character Class :character 1st Qu.: 111 1st Qu.: 1251
## Mode :character Mode :character Median : 434 Median : 12698
## Mean : 42093 Mean : 63518
## 3rd Qu.: 5766 3rd Qu.: 54239
## Max. :14645544 Max. :4616735
## NA's :26 NA's :12
## Insp av_uprice med_uprice rel_uprice
## fraud: 254 Min. : 0.0616 Min. : 0.0174 Min. :-199.34926
## ok :2892 1st Qu.: 9.6124 1st Qu.: 8.4807 1st Qu.: -41.62321
## Median : 16.0065 Median : 12.1429 Median : 0.00506
## Mean : 22.9048 Mean : 15.3369 Mean : 13.79070
## 3rd Qu.: 21.5857 3rd Qu.: 16.3366 3rd Qu.: 75.65502
## Max. :684.3182 Max. :329.3137 Max. : 199.54365
## NA's :37
## missing Insp_pred Insp_pred_logit_prob Insp_pred_logit_class
## both missing : 1 fraud: 197 Min. :0.8853 fraud:2166
## no missing :3109 ok :2949 1st Qu.:0.9153 ok : 943
## Quant missing: 25 Median :0.9219 NA's : 37
## Val missing : 11 Mean :0.9231
## 3rd Qu.:0.9327
## Max. :0.9475
## NA's :37
confusionMatrix(test$Insp_pred_logit_class, test$Insp)
## Confusion Matrix and Statistics
##
## Reference
## Prediction fraud ok
## fraud 126 2040
## ok 115 828
##
## Accuracy : 0.3069
## 95% CI : (0.2907, 0.3234)
## No Information Rate : 0.9225
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0405
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.52282
## Specificity : 0.28870
## Pos Pred Value : 0.05817
## Neg Pred Value : 0.87805
## Prevalence : 0.07752
## Detection Rate : 0.04053
## Detection Prevalence : 0.69669
## Balanced Accuracy : 0.40576
##
## 'Positive' Class : fraud
##
IN-CLASS EXERCISE 6: Estimate a logistic regression using rel_uprice
and rel_uprice
squared as a predictor. Use 0.93 probablity as a cutoff for classifying transaction as ok
. How does this mode do?
train <- train %>%
mutate(rel_uprice_sq = rel_uprice*rel_uprice)
test <- test %>%
mutate(rel_uprice_sq = rel_uprice*rel_uprice)
logit <- glm(Insp ~ rel_uprice + rel_uprice_sq, family="binomial", data=train)
summary(logit)
##
## Call:
## glm(formula = Insp ~ rel_uprice + rel_uprice_sq, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4027 0.0809 0.0970 0.1549 2.5341
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.760e+00 1.214e-01 47.44 <2e-16 ***
## rel_uprice 4.789e-03 3.693e-04 12.97 <2e-16 ***
## rel_uprice_sq -2.043e-04 4.894e-06 -41.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6752.1 on 12436 degrees of freedom
## Residual deviance: 2838.2 on 12434 degrees of freedom
## (149 observations deleted due to missingness)
## AIC: 2844.2
##
## Number of Fisher Scoring iterations: 7
test$Insp_pred_logit_prob <- predict(logit, test, type="response")
test$Insp_pred_logit_class <- as.factor(ifelse(test$Insp_pred_logit_prob>0.93,"ok","fraud"))
summary(test)
## ID Prod Quant Val
## Length:3146 Length:3146 Min. : 100 Min. : 1005
## Class :character Class :character 1st Qu.: 111 1st Qu.: 1251
## Mode :character Mode :character Median : 434 Median : 12698
## Mean : 42093 Mean : 63518
## 3rd Qu.: 5766 3rd Qu.: 54239
## Max. :14645544 Max. :4616735
## NA's :26 NA's :12
## Insp av_uprice med_uprice rel_uprice
## fraud: 254 Min. : 0.0616 Min. : 0.0174 Min. :-199.34926
## ok :2892 1st Qu.: 9.6124 1st Qu.: 8.4807 1st Qu.: -41.62321
## Median : 16.0065 Median : 12.1429 Median : 0.00506
## Mean : 22.9048 Mean : 15.3369 Mean : 13.79070
## 3rd Qu.: 21.5857 3rd Qu.: 16.3366 3rd Qu.: 75.65502
## Max. :684.3182 Max. :329.3137 Max. : 199.54365
## NA's :37
## missing Insp_pred Insp_pred_logit_prob Insp_pred_logit_class
## both missing : 1 fraud: 197 Min. :0.03513 fraud: 446
## no missing :3109 ok :2949 1st Qu.:0.98125 ok :2663
## Quant missing: 25 Median :0.99419 NA's : 37
## Val missing : 11 Mean :0.92845
## 3rd Qu.:0.99650
## Max. :0.99695
## NA's :37
## rel_uprice_sq
## Min. : 0.0
## 1st Qu.: 505.4
## Median : 2748.8
## Mean : 7216.5
## 3rd Qu.: 9779.2
## Max. :39817.7
## NA's :37
confusionMatrix(test$Insp_pred_logit_class, test$Insp)
## Confusion Matrix and Statistics
##
## Reference
## Prediction fraud ok
## fraud 209 237
## ok 32 2631
##
## Accuracy : 0.9135
## 95% CI : (0.903, 0.9231)
## No Information Rate : 0.9225
## P-Value [Acc > NIR] : 0.9705
##
## Kappa : 0.5646
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.86722
## Specificity : 0.91736
## Pos Pred Value : 0.46861
## Neg Pred Value : 0.98798
## Prevalence : 0.07752
## Detection Rate : 0.06722
## Detection Prevalence : 0.14345
## Balanced Accuracy : 0.89229
##
## 'Positive' Class : fraud
##
Over-fitting is the tendency for algorithms to fit patterns that exist in the train
data but not in the test
data. This is particularly relevant when using decision trees. In principle, we could grow a really big tree that accurately classifies every single observation in the train
data. However, it is likely that some of these patterns captured in that tree would be specific to only the train
data and would do a poor job predicting classes in the test
data. The fundamental challenge in estimating a model is to find patterns that are enduring and hold true in any data. This is the reason why we evaluate our model predictions on a test
data, i.e. data that our algorithm did not see. If our model does poorly on test
data, it means that we likely over-fit our model, i.e. we fit patterns that exist in the train
data but not in the test
data.
One method that tries to avoid over-fitting is random forest. It is a popular variation on decision trees. It falls into the category of ensemble methods because it utilizes results from multiple algorithms. Random forest utilizes results from an estimation of multiple decision trees. These decision trees are estimated using different samples of the training data. This repeated sampling from the train
data means that random forest picks up patterns that exist in lots of different samples of the training data. This reduces the chances of over-fitting. When making a prediction, the values of the predictor variables are fed to these trees. Different trees may classify an observation differently but the final class
is the one that is the most common.
We will estimate random forest using randomForest
function from the package of the same name. The function requires that all categorical variables are specified as factors. We also need to make sure that the factor levels are the same in test
and train
data. Random forest method is particularly effective when we have a large number of predictors (because it tries different random set of predictors at each split and each tree). Therefore, we will include Val
and Quant
in addition to rel_uprice
and missing
.
One drawback of randomForest
is that it does not handle missing values. Therefore, before we can run it, we need to impute values to any variables and any observations that have NA
s. Below I choose to replace NA
s in rel_uprice
with zero, and NA
s in Val
and Quant
with their respective means in train
and test
data.
library(randomForest)
train <- train %>%
mutate(rel_uprice = ifelse(is.na(rel_uprice),0,rel_uprice),
Val = ifelse(is.na(Val),mean(Val, na.rm = TRUE),Val),
Quant = ifelse(is.na(Quant),mean(Quant, na.rm = TRUE),Quant))
test <- test %>%
mutate(rel_uprice = ifelse(is.na(rel_uprice),0,rel_uprice),
Val = ifelse(is.na(Val),mean(Val, na.rm = TRUE),Val),
Quant = ifelse(is.na(Quant),mean(Quant, na.rm = TRUE),Quant))
rf <- randomForest(Insp ~ rel_uprice + missing + Val + Quant, data = train)
rf
##
## Call:
## randomForest(formula = Insp ~ rel_uprice + missing + Val + Quant, data = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 2.44%
## Confusion matrix:
## fraud ok class.error
## fraud 795 221 0.217519685
## ok 86 11484 0.007433016
test$Insp_pred_rf <- predict(rf, test, type="class")
confusionMatrix(test$Insp_pred_rf, test$Insp)
## Confusion Matrix and Statistics
##
## Reference
## Prediction fraud ok
## fraud 195 19
## ok 59 2873
##
## Accuracy : 0.9752
## 95% CI : (0.9692, 0.9804)
## No Information Rate : 0.9193
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.82
##
## Mcnemar's Test P-Value : 1.006e-05
##
## Sensitivity : 0.76772
## Specificity : 0.99343
## Pos Pred Value : 0.91121
## Neg Pred Value : 0.97988
## Prevalence : 0.08074
## Detection Rate : 0.06198
## Detection Prevalence : 0.06802
## Balanced Accuracy : 0.88057
##
## 'Positive' Class : fraud
##
We have Kappa of 0.86 and super-high specificities and sensitivities. Random forest is a substantial improvement over single decision tree algorithms we used above.
titanic_train.csv
data again. It has the following variables:Survived
Survival (0 = No; 1 = Yes)Pclass
Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)Name
NameSex
SexAge
Age (in years)SibSp
Number of Siblings/Spouses AboardParch
Number of Parents/Children AboardTicket
Ticket NumberFare
Passenger FareCabin
CabinEmbarked
Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)Which variables do you think may be good predictors of the survival on the Titanic? Create two or three graphs similar to those in lab 11 (the data exploration lab). (Hint1: You may want to turn the Survived
variable into a factor using the factor()
function. Hint2: You may want to add + geom_jitter()
to plots involving qualitative variables.)
Estimate a decision tree predicting survival using age and sex as predictors. Describe your results.
Estimate a decision tree using age, sex and passenger class. Describe your results. Use option minsplit=50
to prune the tree.
Estimate your own decision tree with your own set of predictors (you are, of course, free to include the predictors we used above).
Download titanic_test.csv
from Nexus. This is a test
data from Kaggle. It has information on 418 passengers except we actually don't know whether they survived or not. Use your favorite model from above to predict which of these passengers survived. Show your predictions for the first 6 passengers.
Even though we don't know the fate of the passengers in the test data set, Kaggle does. In fact, Kaggle will evaluate our predictions and tell us how many we got right. Moreover, Kaggle will compare the accuracy of our predictions to those of other participants in the competition. All we have to do is register with Kaggle and create a .csv file that contains two columns: PassengerId
and Survived
, where Survived
contains our predictions: 0 for did not survive, and 1 for survived. You can use write_csv()
to make that .csv
file (make sure to select()
only PassengerId
and Survived
. Submit your predictions to Kaggle and report on your accuracy and rank compared to other participants. You can make up to 5 submissions per day if you would like to experiment with different models, and improve your submission. Feel free to try randomForest
. Take a screenshot of your submission result, and attach it to your lab. I will pay five dollars to the team with the highest rank. Good luck!