# Lab 12: Decision Trees II: Toy Example and Detecting Fraudulent Transactions

### Learning Objectives

• understanding the nature decision tree algorithm
• estimating decision tree using rpart function
• making predictions based on a decision tree
• over-fitting
• randomForest

### 1. A toy example

Let's first apply decision trees to a toy data set. We have two variables: Insp which can be either ok or fraud, and rel_uprice which we will use to classify Insp as either ok or fraud.

library(tidyverse)
library(caret)
toy <- data.frame(Insp=c("ok","ok","ok", "fraud","ok", "fraud"),
rel_uprice=c(1,2,2,90,100,110))
toy
##    Insp rel_uprice
## 1    ok          1
## 2    ok          2
## 3    ok          2
## 4 fraud         90
## 5    ok        100
## 6 fraud        110

Decision tree algorithm uses the predictor variable, rel_uprice to split data set into parts such that each part is relatively homogeneous in terms of the classes it contains. Our data is already sorted by relative unit price. Notice that if we split the data into observations where unit price is greater than 2 we get a split where one part is completely homogeneous (all ok transactions) and the other part is two thirds fraudulent. So, a reasonable decision tree would be to classify all transactions that have price less than, say, 46 as ok and others as fraud.

Let's see what the rpart algorithm thinks. rpart is one of a number of decision tree algorithms available for R. (Lantz discusses C5.0). I like rpart because it has nice visualization feature, rpart.plot described here. Plus, its syntax is consistent with other methods such as randomForest. The first argument to rpart is a formula in the familiar format of Y ~ X1 + X2 where Y is the variable we are trying to predict, X1, X2 are predictor variables. The second argument is data, and third contains options that let us control the complexity of the tree. For example, minsplit specifies the minimum number of observations for the algorithm to attempt a split. The lower the number, the more splits will be attempted, and the larger, more complex the tree.

library(rpart)
library(rpart.plot)
tree <- rpart(Insp ~ rel_uprice, data=toy, minsplit=3)
summary(tree) 
## Call:
## rpart(formula = Insp ~ rel_uprice, data = toy, minsplit = 3)
##   n= 6
##
##     CP nsplit rel error xerror      xstd
## 1 0.50      0       1.0    1.0 0.5773503
## 2 0.01      1       0.5    1.5 0.6123724
##
## Variable importance
## rel_uprice
##        100
##
## Node number 1: 6 observations,    complexity param=0.5
##   predicted class=ok     expected loss=0.3333333  P(node) =1
##     class counts:     2     4
##    probabilities: 0.333 0.667
##   left son=2 (3 obs) right son=3 (3 obs)
##   Primary splits:
##       rel_uprice < 46 to the right, improve=1.333333, (0 missing)
##
## Node number 2: 3 observations
##   predicted class=fraud  expected loss=0.3333333  P(node) =0.5
##     class counts:     2     1
##    probabilities: 0.667 0.333
##
## Node number 3: 3 observations
##   predicted class=ok     expected loss=0  P(node) =0.5
##     class counts:     0     3
##    probabilities: 0.000 1.000
rpart.plot(tree) Well, we were right. The algorithm says that if price is greater than or equal to 46, the transaction is classified as fraud, if it is less than 46, the transaction is classified as ok. Thus, we have a tree with two branches. In the left branch observations are classified as fraud.

In each node we have three pieces of information. First, we see the class label telling us how the observation in that node are classified (ok vs fraud). Second, we see the probability of the class being predicted (in our case ok) in that branch of the tree - it is the fraction of ok observations in that branch. For example, in the left branch (labeled fraud) the probability of ok transaction is 33%. In the right branch (labeled ok) the probability of ok transaction is 100%. The last number is the percentage of observations in that node. In our case the observations are split 50/50 between the two nodes.

IN-CLASS EXERCISE 1: Take a look at the data below. What do you think the decision tree should look like if we use rel_uprice and missing as predictors? How did the algorithm handle the missing values of rel_uprice? How would the logistic regression handle it?

IN-CLASS EXERCISE 2: Suppose rel_uprice was constant. What would the algorithm do?

### 2. Predicting with decision trees

Suppose we have three new transactions where we don't know whether they are ok or fraud, but we know the rel_uprice associated with each transaction. We can have function predict() plug these values into our original decision tree, tree and predict will tell us how the tree classified these transactions:

test <- data.frame(rel_uprice=c(200,20,NA))

test$Insp_predicted <- predict(tree, test, type="class") test ## rel_uprice Insp_predicted ## 1 200 fraud ## 2 20 ok ## 3 NA ok ### 3. Application to detecting fraudulent transactions Let's load data on sales reports from lab 11. Here we will use only transactions that have been inspected. Let's turn Insp into a factor because that is what rpart will be looking for. Let's also make missing a factor since that is what randomForest will be looking for in section 7. inspected <- read_csv("inspected.csv") inspected$Insp <- as.factor(inspected$Insp) inspected$missing <- as.factor(inspected$missing) Let's now create a training and testing data sets. We will use function sample as we did in the past. We will use at 80/20 split. set.seed(364) #set the seed for reproducibility sample <- sample(nrow(inspected), round(nrow(inspected)*0.8)) train <- inspected[sample,] test <- inspected[-sample,] Since fraudulent transaction are pretty infrequent, and our inspected data set is not particularly large, we should check that we have roughly the same proportion of fraudulent transactions in both test and train data sets. prop.table(table(train$Insp))
##
##      fraud         ok
## 0.08072461 0.91927539
prop.table(table(test$Insp)) ## ## fraud ok ## 0.08073744 0.91926256 ### 4. Estimate a decision tree Let's use rel_uprice and missing as predictors of Insp. We will not include the minsplit option as we did above because we are no longer working with a "toy" data set. This leaves it at the default value of 20. tree <- rpart(Insp ~ rel_uprice + missing, data=train) summary(tree) ## Call: ## rpart(formula = Insp ~ rel_uprice + missing, data = train) ## n= 12586 ## ## CP nsplit rel error xerror xstd ## 1 0.29232283 0 1.0000000 1.0000000 0.03007987 ## 2 0.23425197 1 0.7076772 0.7096457 0.02566043 ## 3 0.02854331 2 0.4734252 0.4881890 0.02148406 ## 4 0.01000000 3 0.4448819 0.4596457 0.02087153 ## ## Variable importance ## rel_uprice missing ## 95 5 ## ## Node number 1: 12586 observations, complexity param=0.2923228 ## predicted class=ok expected loss=0.08072461 P(node) =1 ## class counts: 1016 11570 ## probabilities: 0.081 0.919 ## left son=2 (459 obs) right son=3 (12127 obs) ## Primary splits: ## rel_uprice < -154.9215 to the left, improve=531.17180, (149 missing) ## missing splits as RRRL, improve= 48.77332, (0 missing) ## ## Node number 2: 459 observations ## predicted class=fraud expected loss=0.1764706 P(node) =0.03646909 ## class counts: 378 81 ## probabilities: 0.824 0.176 ## ## Node number 3: 12127 observations, complexity param=0.234252 ## predicted class=ok expected loss=0.05260988 P(node) =0.9635309 ## class counts: 638 11489 ## probabilities: 0.053 0.947 ## left son=6 (342 obs) right son=7 (11785 obs) ## Primary splits: ## rel_uprice < 180.2213 to the right, improve=450.0984, (149 missing) ## missing splits as RRRL, improve= 52.1244, (0 missing) ## ## Node number 6: 342 observations ## predicted class=fraud expected loss=0.1520468 P(node) =0.02717305 ## class counts: 290 52 ## probabilities: 0.848 0.152 ## ## Node number 7: 11785 observations, complexity param=0.02854331 ## predicted class=ok expected loss=0.02952906 P(node) =0.9363579 ## class counts: 348 11437 ## probabilities: 0.030 0.970 ## left son=14 (35 obs) right son=15 (11750 obs) ## Primary splits: ## missing splits as RRRL, improve=54.95882, (0 missing) ## rel_uprice < 156.4715 to the right, improve=39.57389, (149 missing) ## ## Node number 14: 35 observations ## predicted class=fraud expected loss=0.08571429 P(node) =0.002780868 ## class counts: 32 3 ## probabilities: 0.914 0.086 ## ## Node number 15: 11750 observations ## predicted class=ok expected loss=0.02689362 P(node) =0.933577 ## class counts: 316 11434 ## probabilities: 0.027 0.973 rpart.plot(tree) We grew a tree with 4 branches. It tells us that if price is less than negative 155, classify transaction as fraud. If not, ask if price is greater than or equal to 180. If yes, classify a transaction as fraud. If not, ask if missing is equal to "Val missing". If yes, classify transaction as fraud, or else classify transaction as ok. In short, if the price is really high or really low, transaction is a fraud. Otherwise transaction is fraud only if Val is missing. That is a pretty reasonable decision tree. Let's examine the information in the output. In the leftmost bottom bucket we see that 4% of transactions have price below negative 155. They are all classified as fraud even though 17% of these transactions are actually ok. In the rightmost bottom bucket we see that about 94% of transactions in the train data are classified as ok. Of those transactions, 97% are actually ok. Let's see how it actually does in classifying transactions in the test data set. ### 5. Make and evaluate predictions test$Insp_pred <- predict(tree, test, type="class")
confusionMatrix(test$Insp_pred, test$Insp)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction fraud   ok
##      fraud   166   31
##      ok       88 2861
##
##                Accuracy : 0.9622
##                  95% CI : (0.9549, 0.9686)
##     No Information Rate : 0.9193
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.7161
##
##  Mcnemar's Test P-Value : 2.844e-07
##
##             Sensitivity : 0.65354
##             Specificity : 0.98928
##          Pos Pred Value : 0.84264
##          Neg Pred Value : 0.97016
##              Prevalence : 0.08074
##          Detection Rate : 0.05277
##    Detection Prevalence : 0.06262
##       Balanced Accuracy : 0.82141
##
##        'Positive' Class : fraud
## 

We achieved 96% accuracy which is better than 92% which we would get if declared every transaction as ok. Amazingly, Kappa is 0.72 suggesting that our predictors make a lot of difference.

IN-CLASS EXERCISE 3: Estimate decision tree using only rel_uprice as the sole predictor. Does the tree do better than our model above?

IN-CLASS EXERCISE 4: Estimate decision tree using only Val and Quant as predictors. Does the tree do better than the our model above?

IN-CLASS EXERCISE 5: Estimate a logistic regression using rel_uprice as a predictor. Use 0.93 probablity as a cutoff for classifying transaction as ok.How does this mode do?

logit <- glm(Insp ~ rel_uprice, family="binomial", data=train)
summary(logit)
##
## Call:
## glm(formula = Insp ~ rel_uprice, family = "binomial", data = train)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -2.4281   0.3680   0.4001   0.4168   0.4927
##
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.4686713  0.0336694  73.321  < 2e-16 ***
## rel_uprice  0.0021298  0.0003983   5.347 8.95e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 6752.1  on 12436  degrees of freedom
## Residual deviance: 6723.3  on 12435  degrees of freedom
##   (149 observations deleted due to missingness)
## AIC: 6727.3
##
## Number of Fisher Scoring iterations: 5
test$Insp_pred_logit_prob <- predict(logit, test, type="response") test$Insp_pred_logit_class <- as.factor(ifelse(test$Insp_pred_logit_prob>0.93,"ok","fraud")) summary(test) ## ID Prod Quant Val ## Length:3146 Length:3146 Min. : 100 Min. : 1005 ## Class :character Class :character 1st Qu.: 111 1st Qu.: 1251 ## Mode :character Mode :character Median : 434 Median : 12698 ## Mean : 42093 Mean : 63518 ## 3rd Qu.: 5766 3rd Qu.: 54239 ## Max. :14645544 Max. :4616735 ## NA's :26 NA's :12 ## Insp av_uprice med_uprice rel_uprice ## fraud: 254 Min. : 0.0616 Min. : 0.0174 Min. :-199.34926 ## ok :2892 1st Qu.: 9.6124 1st Qu.: 8.4807 1st Qu.: -41.62321 ## Median : 16.0065 Median : 12.1429 Median : 0.00506 ## Mean : 22.9048 Mean : 15.3369 Mean : 13.79070 ## 3rd Qu.: 21.5857 3rd Qu.: 16.3366 3rd Qu.: 75.65502 ## Max. :684.3182 Max. :329.3137 Max. : 199.54365 ## NA's :37 ## missing Insp_pred Insp_pred_logit_prob Insp_pred_logit_class ## both missing : 1 fraud: 197 Min. :0.8853 fraud:2166 ## no missing :3109 ok :2949 1st Qu.:0.9153 ok : 943 ## Quant missing: 25 Median :0.9219 NA's : 37 ## Val missing : 11 Mean :0.9231 ## 3rd Qu.:0.9327 ## Max. :0.9475 ## NA's :37 confusionMatrix(test$Insp_pred_logit_class, test$Insp) ## Confusion Matrix and Statistics ## ## Reference ## Prediction fraud ok ## fraud 126 2040 ## ok 115 828 ## ## Accuracy : 0.3069 ## 95% CI : (0.2907, 0.3234) ## No Information Rate : 0.9225 ## P-Value [Acc > NIR] : 1 ## ## Kappa : -0.0405 ## ## Mcnemar's Test P-Value : <2e-16 ## ## Sensitivity : 0.52282 ## Specificity : 0.28870 ## Pos Pred Value : 0.05817 ## Neg Pred Value : 0.87805 ## Prevalence : 0.07752 ## Detection Rate : 0.04053 ## Detection Prevalence : 0.69669 ## Balanced Accuracy : 0.40576 ## ## 'Positive' Class : fraud ##  IN-CLASS EXERCISE 6: Estimate a logistic regression using rel_uprice and rel_uprice squared as a predictor. Use 0.93 probablity as a cutoff for classifying transaction as ok. How does this mode do? train <- train %>% mutate(rel_uprice_sq = rel_uprice*rel_uprice) test <- test %>% mutate(rel_uprice_sq = rel_uprice*rel_uprice) logit <- glm(Insp ~ rel_uprice + rel_uprice_sq, family="binomial", data=train) summary(logit) ## ## Call: ## glm(formula = Insp ~ rel_uprice + rel_uprice_sq, family = "binomial", ## data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.4027 0.0809 0.0970 0.1549 2.5341 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 5.760e+00 1.214e-01 47.44 <2e-16 *** ## rel_uprice 4.789e-03 3.693e-04 12.97 <2e-16 *** ## rel_uprice_sq -2.043e-04 4.894e-06 -41.74 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 6752.1 on 12436 degrees of freedom ## Residual deviance: 2838.2 on 12434 degrees of freedom ## (149 observations deleted due to missingness) ## AIC: 2844.2 ## ## Number of Fisher Scoring iterations: 7 test$Insp_pred_logit_prob <- predict(logit, test, type="response")
test$Insp_pred_logit_class <- as.factor(ifelse(test$Insp_pred_logit_prob>0.93,"ok","fraud"))
summary(test)
##       ID                Prod               Quant               Val
##  Length:3146        Length:3146        Min.   :     100   Min.   :   1005
##  Class :character   Class :character   1st Qu.:     111   1st Qu.:   1251
##  Mode  :character   Mode  :character   Median :     434   Median :  12698
##                                        Mean   :   42093   Mean   :  63518
##                                        3rd Qu.:    5766   3rd Qu.:  54239
##                                        Max.   :14645544   Max.   :4616735
##                                        NA's   :26         NA's   :12
##     Insp        av_uprice          med_uprice         rel_uprice
##  fraud: 254   Min.   :  0.0616   Min.   :  0.0174   Min.   :-199.34926
##  ok   :2892   1st Qu.:  9.6124   1st Qu.:  8.4807   1st Qu.: -41.62321
##               Median : 16.0065   Median : 12.1429   Median :   0.00506
##               Mean   : 22.9048   Mean   : 15.3369   Mean   :  13.79070
##               3rd Qu.: 21.5857   3rd Qu.: 16.3366   3rd Qu.:  75.65502
##               Max.   :684.3182   Max.   :329.3137   Max.   : 199.54365
##                                                     NA's   :37
##           missing     Insp_pred    Insp_pred_logit_prob Insp_pred_logit_class
##  both missing :   1   fraud: 197   Min.   :0.03513      fraud: 446
##  no missing   :3109   ok   :2949   1st Qu.:0.98125      ok   :2663
##  Quant missing:  25                Median :0.99419      NA's :  37
##  Val missing  :  11                Mean   :0.92845
##                                    3rd Qu.:0.99650
##                                    Max.   :0.99695
##                                    NA's   :37
##  rel_uprice_sq
##  Min.   :    0.0
##  1st Qu.:  505.4
##  Median : 2748.8
##  Mean   : 7216.5
##  3rd Qu.: 9779.2
##  Max.   :39817.7
##  NA's   :37
confusionMatrix(test$Insp_pred_logit_class, test$Insp)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction fraud   ok
##      fraud   209  237
##      ok       32 2631
##
##                Accuracy : 0.9135
##                  95% CI : (0.903, 0.9231)
##     No Information Rate : 0.9225
##     P-Value [Acc > NIR] : 0.9705
##
##                   Kappa : 0.5646
##
##  Mcnemar's Test P-Value : <2e-16
##
##             Sensitivity : 0.86722
##             Specificity : 0.91736
##          Pos Pred Value : 0.46861
##          Neg Pred Value : 0.98798
##              Prevalence : 0.07752
##          Detection Rate : 0.06722
##    Detection Prevalence : 0.14345
##       Balanced Accuracy : 0.89229
##
##        'Positive' Class : fraud
## 

### 6. Over-fitting

Over-fitting is the tendency for algorithms to fit patterns that exist in the train data but not in the test data. This is particularly relevant when using decision trees. In principle, we could grow a really big tree that accurately classifies every single observation in the train data. However, it is likely that some of these patterns captured in that tree would be specific to only the train data and would do a poor job predicting classes in the test data. The fundamental challenge in estimating a model is to find patterns that are enduring and hold true in any data. This is the reason why we evaluate our model predictions on a test data, i.e. data that our algorithm did not see. If our model does poorly on test data, it means that we likely over-fit our model, i.e. we fit patterns that exist in the train data but not in the test data.

### 7. Random Forest

One method that tries to avoid over-fitting is random forest. It is a popular variation on decision trees. It falls into the category of ensemble methods because it utilizes results from multiple algorithms. Random forest utilizes results from an estimation of multiple decision trees. These decision trees are estimated using different samples of the training data. This repeated sampling from the train data means that random forest picks up patterns that exist in lots of different samples of the training data. This reduces the chances of over-fitting. When making a prediction, the values of the predictor variables are fed to these trees. Different trees may classify an observation differently but the final class is the one that is the most common.

We will estimate random forest using randomForest function from the package of the same name. The function requires that all categorical variables are specified as factors. We also need to make sure that the factor levels are the same in test and train data. Random forest method is particularly effective when we have a large number of predictors (because it tries different random set of predictors at each split and each tree). Therefore, we will include Val and Quant in addition to rel_uprice and missing.

One drawback of randomForest is that it does not handle missing values. Therefore, before we can run it, we need to impute values to any variables and any observations that have NAs. Below I choose to replace NAs in rel_uprice with zero, and NAs in Val and Quant with their respective means in train and test data.

library(randomForest)
train <- train %>%
mutate(rel_uprice = ifelse(is.na(rel_uprice),0,rel_uprice),
Val = ifelse(is.na(Val),mean(Val, na.rm = TRUE),Val),
Quant = ifelse(is.na(Quant),mean(Quant, na.rm = TRUE),Quant))
test <- test %>%
mutate(rel_uprice = ifelse(is.na(rel_uprice),0,rel_uprice),
Val = ifelse(is.na(Val),mean(Val, na.rm = TRUE),Val),
Quant = ifelse(is.na(Quant),mean(Quant, na.rm = TRUE),Quant))

rf <- randomForest(Insp ~ rel_uprice + missing + Val + Quant, data = train)
rf
##
## Call:
##  randomForest(formula = Insp ~ rel_uprice + missing + Val + Quant,      data = train)
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
##
##         OOB estimate of  error rate: 2.44%
## Confusion matrix:
##       fraud    ok class.error
## fraud   795   221 0.217519685
## ok       86 11484 0.007433016
test$Insp_pred_rf <- predict(rf, test, type="class") confusionMatrix(test$Insp_pred_rf, test\$Insp)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction fraud   ok
##      fraud   195   19
##      ok       59 2873
##
##                Accuracy : 0.9752
##                  95% CI : (0.9692, 0.9804)
##     No Information Rate : 0.9193
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.82
##
##  Mcnemar's Test P-Value : 1.006e-05
##
##             Sensitivity : 0.76772
##             Specificity : 0.99343
##          Pos Pred Value : 0.91121
##          Neg Pred Value : 0.97988
##              Prevalence : 0.08074
##          Detection Rate : 0.06198
##    Detection Prevalence : 0.06802
##       Balanced Accuracy : 0.88057
##
##        'Positive' Class : fraud
## 

We have Kappa of 0.86 and super-high specificities and sensitivities. Random forest is a substantial improvement over single decision tree algorithms we used above.

### Exercises:

1. Load the titanic_train.csv data again. It has the following variables:
• Survived Survival (0 = No; 1 = Yes)
• Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
• Name Name
• Sex Sex
• Age Age (in years)
• SibSp Number of Siblings/Spouses Aboard
• Parch Number of Parents/Children Aboard
• Ticket Ticket Number
• Fare Passenger Fare
• Cabin Cabin
• Embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
1. Which variables do you think may be good predictors of the survival on the Titanic? Create two or three graphs similar to those in lab 11 (the data exploration lab). (Hint1: You may want to turn the Survived variable into a factor using the factor() function. Hint2: You may want to add + geom_jitter() to plots involving qualitative variables.)

2. Estimate a decision tree predicting survival using age and sex as predictors. Describe your results.

3. Estimate a decision tree using age, sex and passenger class. Describe your results. Use option minsplit=50 to prune the tree.

4. Estimate your own decision tree with your own set of predictors (you are, of course, free to include the predictors we used above).

5. Download titanic_test.csv from Nexus. This is a test data from Kaggle. It has information on 418 passengers except we actually don't know whether they survived or not. Use your favorite model from above to predict which of these passengers survived. Show your predictions for the first 6 passengers.

6. Even though we don't know the fate of the passengers in the test data set, Kaggle does. In fact, Kaggle will evaluate our predictions and tell us how many we got right. Moreover, Kaggle will compare the accuracy of our predictions to those of other participants in the competition. All we have to do is register with Kaggle and create a .csv file that contains two columns: PassengerId and Survived, where Survived contains our predictions: 0 for did not survive, and 1 for survived. You can use write_csv() to make that .csv file (make sure to select() only PassengerId and Survived. Submit your predictions to Kaggle and report on your accuracy and rank compared to other participants. You can make up to 5 submissions per day if you would like to experiment with different models, and improve your submission. Feel free to try randomForest. Take a screenshot of your submission result, and attach it to your lab. I will pay five dollars to the team with the highest rank. Good luck!