Lab 10: Trade-off between sensitivity and specificity

Learning Objectives:

predicting with logistic regression
two types of errors, two types of correct classification (sensitivity and specificity)
fundamental trade-off between sensitivity and specificity
kappa and confusionMatrix()
ROC curves

The background reading for this lab is chapter 10 in Lantz.

1. Predicting with a logistic model

In lab 9 we estimated a logistic model using all of the observations. This is appropriate in a “confirmatory” analysis where we test a particular hypothesis. In this class however, we are interested in making predictions - using past data to predict the future. In order to test how well our predictions might work on future data we always split our data into train and test. We then estimate our model on train data, and evaluate our predictions in the test data. Let’s use 80/20 train/test split.

library(tidyverse)
library(descr)
loan <- read_csv("lending_club_cleaned.csv")
loan$status <- as.factor(loan$status)
set.seed(364)
sample <- sample(nrow(loan),floor(nrow(loan)*0.8))
train <- loan[sample,]
test <- loan[-sample,]

Let’s estimate the model with fico, dti, loan_amnt and purpose as predictors. predict() function returns probabilities that loan is good.

logit4 <- glm(status ~ fico + dti+ loan_amnt + purpose, data = train, family = "binomial")
test$pred <- predict(logit4, test, type="response")
head(test$pred)

## [1] 0.9044742 0.8220173 0.8638981 0.8550825 0.8062179 0.8413071

summary(test$pred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4347  0.8107  0.8563  0.8498  0.8979  0.9719

We can see that the estimated probabilities of a loan being good range from 0.43 to 0.97 in the test data set. What probability cutoff should we use for classifying a loan as “good”? For now, let’s classify any loan that has more than 0.8 probability of being good as “good”, the rest is classified as “bad”.

test$status_pred80 <- ifelse(test$pred > 0.80, "good", "bad")
crosstab(test$status,test$status_pred80, prop.t=TRUE, plot=FALSE)

##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |           Total Percent | 
## |-------------------------|
## 
## ====================================
##                test$status_pred80
## test$status      bad    good   Total
## ------------------------------------
## bad             427     818    1245 
##                 5.0%    9.6%        
## ------------------------------------
## good           1265    5997    7262 
##                14.9%   70.5%        
## ------------------------------------
## Total          1692    6815    8507 
## ====================================

We have accuracy of 75.5%. We detect bad loans in 34% of cases, and we label good loans as good in 83% of cases.

2. Two types of errors, two types of correct classification

In any classification problem, we make two types of errors: false positives (e.g. saying a loan is bad when it fact is good) and false negatives (saying a loan is good when it is bad). It is arbitrary as to which class we designate as positive. In this case, we designated “bad” as positive. As we do in the medical field, we are testing if a loan is bad. If our algorithm says the loan is bad, we say the test came out positive.

Similarly, there are two types of correct classifications: true positive (saying a loan is bad when it is bad) and true negative (saying a loan is good when it is good). The rate at which we detect true positive is called sensitivity. It is calculated as the share of positive outcomes (bad loans) that were correctly classified as bad. Let’s calculate it using our 0.8 probability as a cutoff.

\(sensitivity = \frac{TP}{TP+FN}=\frac{427}{1245}= 0.34\)

The rate at which we detect true negatives is called specificity. It is the share of negative outcomes (good loans) that were correctly classified as good.

\(specificity = \frac{TN}{TN+FP}=\frac{5997}{7262}=0.83\)

We correctly specified 34% bad loans as bad, and we correctly specified 83% of good loans as good.

3. Fundamental trade-off between sensitivity and specificity

Suppose we decide to use a stricter criteria for classifying a loan as good, say 0.90 probability instead of 0.80.

test$status_pred90 <- ifelse(test$pred > 0.90, "good", "bad")
crosstab(test$status,test$status_pred90, prop.t=TRUE, plot=FALSE)

##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |           Total Percent | 
## |-------------------------|
## 
## ====================================
##                test$status_pred90
## test$status      bad    good   Total
## ------------------------------------
## bad            1098     147    1245 
##                12.9%    1.7%        
## ------------------------------------
## good           5387    1875    7262 
##                63.3%   22.0%        
## ------------------------------------
## Total          6485    2022    8507 
## ====================================

Do you think this is a better prediction? What happened to sensitivity and specificity?

\(sensitivity = \frac{TP}{TP+FN}=\frac{1098}{1245}= 0.88\)

\(specificity = \frac{TN}{TN+FP}=\frac{1875}{7262}=0.26\)

We classified 88% of bad loans correctly as bad – a huge increase in sensitivity, our ability to correctly detect bad loans. However, this came at the cost of reducing our specificity which dropped from 83% to 26%. This is the fundamental trade-off between sensitivity and specificity. As we applied more stringent criterion for a loan to be good, we inevitably classified some good loans as bad.

5. Kappa and evaluating predictions using `confusionMatrix()`

Where to strike the balance between sensitivity and specificity depends on the context of each problem. How many good loans are we willing to pass in order to avoid financing one bad one? Is there a statistic that would summarize the quality of our predictions in one number? Yes, it is called kappa. Kappa summarizes how much of our accuracy is due to chance (i. e. randomly assigning classes but keeping the final count of each class proportional to the count of each class in the training data). Kappa ranges from 0 to 1. We will calculate kappa using function confusionMatrix() from the caret package. The function takes the vector of class predictions as the first argument and vector of actual classes as the second argument.

library(caret)
confusionMatrix(test$status_pred90, test$status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  bad good
##       bad  1098 5387
##       good  147 1875
##                                           
##                Accuracy : 0.3495          
##                  95% CI : (0.3393, 0.3597)
##     No Information Rate : 0.8536          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0511          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8819          
##             Specificity : 0.2582          
##          Pos Pred Value : 0.1693          
##          Neg Pred Value : 0.9273          
##              Prevalence : 0.1464          
##          Detection Rate : 0.1291          
##    Detection Prevalence : 0.7623          
##       Balanced Accuracy : 0.5701          
##                                           
##        'Positive' Class : bad             
##

We see that confusionMatrix calculates a number of performance metrics including, accuracy (83%), sensitivity, specificity, and kappa.

IN-CLASS EXERCISE:
1. Is kappa higher if we use 0.80 as a cutoff?
2. Is kappa higher if we use a logistic model with just fico and dti and 0.80 as a cutoff?

6. Receiver Operating Characteristic (ROC) curve

One way to visualize the trade-off between specificity and sensitivity it to plot their combinations. We know that as sensitivity rises, specificity will decline. The locus of these combinations is called an ROC curve. Before we plot it using package pROC answer these questions:

What would happen to sensitivity and sensitivity if we classify all loans as bad?

What would happen to sensitivity and sensitivity if we classify all loans as good?

library(pROC)
#creates an object with all sorts of diagnostics including sensitivities and specificities
roc <- roc(test$status,test$pred) 
#we need to create a data frame that contains the sensitivities and specificities
roc_curves <- data.frame(sens=roc$sensitivities,
                        spec=roc$specificities,
                        model="logit 4")
ggplot(roc_curves, aes(x=spec, y=sens, color=model)) + geom_line()

7. Using ROC curves to horse-race algorithms

Let’s plot ROC curves for logit4 and logit5 models in the same graph. We will be able to see each algorithm’s performance in terms of sensitivity and specificity.

logit5 <- glm(status ~ fico + dti, data = train, family = "binomial")
test$pred2 <- predict(logit5, test, type="response")
roc <- roc(test$status,test$pred2) 
roc_curve2 <- data.frame(sens=roc$sensitivities,
                        spec=roc$specificities,
                        model="logit 5")
roc_curves <- bind_rows(roc_curves, roc_curve2)
ggplot(roc_curves, aes(x=spec, y=sens, color=model)) + geom_line()

Clearly, logit4 algorithm performs better as it has higher sensitivity and specificity across all of the ranges. This is not surprising give that logit4 has all of the variables including in logit5.

Exercises

Let’s load the NHL data with cumulative wins information (nhl with cum wins.csv). Set seed to 364. Create train data set using games before March 15, 2016, and test data set would include data/games after March 15. Thus, we will try to predict a lot more games than we did in the past, 194 to be precise.
Estimate a logit model using cumwins_net as the only explanatory variable. Is the effect of cumwins_net on the odds of home team winning statistically significant? What is the magnitude of the effect?
Calculate the probability of the home team winning for every game after and including March 15, i.e. all the games in the test data set. The outcome of which two games are you most confident predicting?
Make predictions for the outcomes (homewin) of the games in the test data. Use 0.5 as the cutoff probability for home team winning or losing. Are you happy with the performance of these predictions? What is your Kappa?
Can you change the probability cutoff and improve on your predictions? Explain your answer.
If positive class is that home team loses, what is your sensitivity? What probability cutoff would you choose to make sensitivity really high? What will happen to your specificity?
Plot the ROC curve for your logit model with cumwins_net as the only predictor.
Estimate an alternative logit model with cumwins_home and cumwins_visit as the only predictors. Plot the ROC curve for this and the model from question 7. Which model do you prefer, if any?