confusionMatrix()
The background reading for this lab is chapter 10 in Lantz.
In lab 9 we estimated a logistic model using all of the observations. This is appropriate in a “confirmatory” analysis where we test a particular hypothesis. In this class however, we are interested in making predictions - using past data to predict the future. In order to test how well our predictions might work on future data we always split our data into train and test. We then estimate our model on train data, and evaluate our predictions in the test data. Let’s use 80/20 train/test split.
library(tidyverse)
library(descr)
loan <- read_csv("lending_club_cleaned.csv")
loan$status <- as.factor(loan$status)
set.seed(364)
sample <- sample(nrow(loan),floor(nrow(loan)*0.8))
train <- loan[sample,]
test <- loan[-sample,]
Let’s estimate the model with fico
, dti
, loan_amnt
and purpose
as predictors. predict()
function returns probabilities that loan is good.
logit4 <- glm(status ~ fico + dti+ loan_amnt + purpose, data = train, family = "binomial")
test$pred <- predict(logit4, test, type="response")
head(test$pred)
## [1] 0.9044742 0.8220173 0.8638981 0.8550825 0.8062179 0.8413071
summary(test$pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4347 0.8107 0.8563 0.8498 0.8979 0.9719
We can see that the estimated probabilities of a loan being good range from 0.43 to 0.97 in the test data set. What probability cutoff should we use for classifying a loan as “good”? For now, let’s classify any loan that has more than 0.8 probability of being good as “good”, the rest is classified as “bad”.
test$status_pred80 <- ifelse(test$pred > 0.80, "good", "bad")
crosstab(test$status,test$status_pred80, prop.t=TRUE, plot=FALSE)
## Cell Contents
## |-------------------------|
## | Count |
## | Total Percent |
## |-------------------------|
##
## ====================================
## test$status_pred80
## test$status bad good Total
## ------------------------------------
## bad 427 818 1245
## 5.0% 9.6%
## ------------------------------------
## good 1265 5997 7262
## 14.9% 70.5%
## ------------------------------------
## Total 1692 6815 8507
## ====================================
We have accuracy of 75.5%. We detect bad loans in 34% of cases, and we label good loans as good in 83% of cases.
In any classification problem, we make two types of errors: false positives (e.g. saying a loan is bad when it fact is good) and false negatives (saying a loan is good when it is bad). It is arbitrary as to which class we designate as positive. In this case, we designated “bad” as positive. As we do in the medical field, we are testing if a loan is bad. If our algorithm says the loan is bad, we say the test came out positive.
Similarly, there are two types of correct classifications: true positive (saying a loan is bad when it is bad) and true negative (saying a loan is good when it is good). The rate at which we detect true positive is called sensitivity. It is calculated as the share of positive outcomes (bad loans) that were correctly classified as bad. Let’s calculate it using our 0.8 probability as a cutoff.
\(sensitivity = \frac{TP}{TP+FN}=\frac{427}{1245}= 0.34\)
The rate at which we detect true negatives is called specificity. It is the share of negative outcomes (good loans) that were correctly classified as good.
\(specificity = \frac{TN}{TN+FP}=\frac{5997}{7262}=0.83\)
We correctly specified 34% bad loans as bad, and we correctly specified 83% of good loans as good.
Suppose we decide to use a stricter criteria for classifying a loan as good, say 0.90 probability instead of 0.80.
test$status_pred90 <- ifelse(test$pred > 0.90, "good", "bad")
crosstab(test$status,test$status_pred90, prop.t=TRUE, plot=FALSE)
## Cell Contents
## |-------------------------|
## | Count |
## | Total Percent |
## |-------------------------|
##
## ====================================
## test$status_pred90
## test$status bad good Total
## ------------------------------------
## bad 1098 147 1245
## 12.9% 1.7%
## ------------------------------------
## good 5387 1875 7262
## 63.3% 22.0%
## ------------------------------------
## Total 6485 2022 8507
## ====================================
Do you think this is a better prediction? What happened to sensitivity and specificity?
\(sensitivity = \frac{TP}{TP+FN}=\frac{1098}{1245}= 0.88\)
\(specificity = \frac{TN}{TN+FP}=\frac{1875}{7262}=0.26\)
We classified 88% of bad loans correctly as bad – a huge increase in sensitivity, our ability to correctly detect bad loans. However, this came at the cost of reducing our specificity which dropped from 83% to 26%. This is the fundamental trade-off between sensitivity and specificity. As we applied more stringent criterion for a loan to be good, we inevitably classified some good loans as bad.
confusionMatrix()
Where to strike the balance between sensitivity and specificity depends on the context of each problem. How many good loans are we willing to pass in order to avoid financing one bad one? Is there a statistic that would summarize the quality of our predictions in one number? Yes, it is called kappa. Kappa summarizes how much of our accuracy is due to chance (i. e. randomly assigning classes but keeping the final count of each class proportional to the count of each class in the training data). Kappa ranges from 0 to 1. We will calculate kappa using function confusionMatrix()
from the caret
package. The function takes the vector of class predictions as the first argument and vector of actual classes as the second argument.
library(caret)
confusionMatrix(test$status_pred90, test$status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 1098 5387
## good 147 1875
##
## Accuracy : 0.3495
## 95% CI : (0.3393, 0.3597)
## No Information Rate : 0.8536
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0511
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8819
## Specificity : 0.2582
## Pos Pred Value : 0.1693
## Neg Pred Value : 0.9273
## Prevalence : 0.1464
## Detection Rate : 0.1291
## Detection Prevalence : 0.7623
## Balanced Accuracy : 0.5701
##
## 'Positive' Class : bad
##
We see that confusionMatrix
calculates a number of performance metrics including, accuracy (83%), sensitivity, specificity, and kappa.
IN-CLASS EXERCISE:
1. Is kappa higher if we use 0.80 as a cutoff?
2. Is kappa higher if we use a logistic model with just fico
and dti
and 0.80 as a cutoff?
One way to visualize the trade-off between specificity and sensitivity it to plot their combinations. We know that as sensitivity rises, specificity will decline. The locus of these combinations is called an ROC curve. Before we plot it using package pROC
answer these questions:
What would happen to sensitivity and sensitivity if we classify all loans as bad?
What would happen to sensitivity and sensitivity if we classify all loans as good?
library(pROC)
#creates an object with all sorts of diagnostics including sensitivities and specificities
roc <- roc(test$status,test$pred)
#we need to create a data frame that contains the sensitivities and specificities
roc_curves <- data.frame(sens=roc$sensitivities,
spec=roc$specificities,
model="logit 4")
ggplot(roc_curves, aes(x=spec, y=sens, color=model)) + geom_line()
Let’s plot ROC curves for logit4
and logit5
models in the same graph. We will be able to see each algorithm’s performance in terms of sensitivity and specificity.
logit5 <- glm(status ~ fico + dti, data = train, family = "binomial")
test$pred2 <- predict(logit5, test, type="response")
roc <- roc(test$status,test$pred2)
roc_curve2 <- data.frame(sens=roc$sensitivities,
spec=roc$specificities,
model="logit 5")
roc_curves <- bind_rows(roc_curves, roc_curve2)
ggplot(roc_curves, aes(x=spec, y=sens, color=model)) + geom_line()
Clearly, logit4
algorithm performs better as it has higher sensitivity and specificity across all of the ranges. This is not surprising give that logit4
has all of the variables including in logit5
.
Let’s load the NHL data with cumulative wins information (nhl with cum wins.csv
). Set seed to 364. Create train data set using games before March 15, 2016, and test data set would include data/games after March 15. Thus, we will try to predict a lot more games than we did in the past, 194 to be precise.
Estimate a logit model using cumwins_net
as the only explanatory variable. Is the effect of cumwins_net
on the odds of home team winning statistically significant? What is the magnitude of the effect?
Calculate the probability of the home team winning for every game after and including March 15, i.e. all the games in the test data set. The outcome of which two games are you most confident predicting?
Make predictions for the outcomes (homewin
) of the games in the test data. Use 0.5 as the cutoff probability for home team winning or losing. Are you happy with the performance of these predictions? What is your Kappa?
Can you change the probability cutoff and improve on your predictions? Explain your answer.
If positive
class is that home team loses, what is your sensitivity? What probability cutoff would you choose to make sensitivity really high? What will happen to your specificity?
Plot the ROC curve for your logit model with cumwins_net
as the only predictor.
Estimate an alternative logit model with cumwins_home
and cumwins_visit
as the only predictors. Plot the ROC curve for this and the model from question 7. Which model do you prefer, if any?