- predicting with logistic regression
- two types of errors, two types of correct classification (
*sensitivity*and*specificity*) - fundamental trade-off between sensitivity and specificity
- kappa and
`confusionMatrix()`

- ROC curves

The background reading for this lab is chapter 10 in Lantz.

In lab 9 we estimated a logistic model using all of the observations. This is appropriate in a “confirmatory” analysis where we test a particular hypothesis. In this class however, we are interested in making predictions - using past data to predict the future. In order to test how well our predictions might work on future data we always split our data into *train* and *test*. We then estimate our model on *train* data, and evaluate our predictions in the *test* data. Let’s use 80/20 *train/test* split.

```
library(tidyverse)
library(descr)
loan <- read_csv("lending_club_cleaned.csv")
loan$status <- as.factor(loan$status)
set.seed(364)
sample <- sample(nrow(loan),floor(nrow(loan)*0.8))
train <- loan[sample,]
test <- loan[-sample,]
```

Let’s estimate the model with `fico`

, `dti`

, `loan_amnt`

and `purpose`

as predictors. `predict()`

function returns probabilities that loan is good.

```
logit4 <- glm(status ~ fico + dti+ loan_amnt + purpose, data = train, family = "binomial")
test$pred <- predict(logit4, test, type="response")
head(test$pred)
```

`## [1] 0.9044742 0.8220173 0.8638981 0.8550825 0.8062179 0.8413071`

`summary(test$pred)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4347 0.8107 0.8563 0.8498 0.8979 0.9719
```

We can see that the estimated probabilities of a loan being good range from 0.43 to 0.97 in the test data set. What probability cutoff should we use for classifying a loan as “good”? For now, let’s classify any loan that has more than 0.8 probability of being good as “good”, the rest is classified as “bad”.

```
test$status_pred80 <- ifelse(test$pred > 0.80, "good", "bad")
crosstab(test$status,test$status_pred80, prop.t=TRUE, plot=FALSE)
```

```
## Cell Contents
## |-------------------------|
## | Count |
## | Total Percent |
## |-------------------------|
##
## ====================================
## test$status_pred80
## test$status bad good Total
## ------------------------------------
## bad 427 818 1245
## 5.0% 9.6%
## ------------------------------------
## good 1265 5997 7262
## 14.9% 70.5%
## ------------------------------------
## Total 1692 6815 8507
## ====================================
```

We have accuracy of 75.5%. We detect bad loans in 34% of cases, and we label good loans as good in 83% of cases.

In any classification problem, we make two types of *errors*: false positives (e.g. saying a loan is bad when it fact is good) and false negatives (saying a loan is good when it is bad). It is arbitrary as to which class we designate as *positive*. In this case, we designated “bad” as *positive*. As we do in the medical field, we are testing if a loan is bad. If our algorithm says the loan is bad, we say the test came out positive.

Similarly, there are two types of *correct* classifications: true positive (saying a loan is bad when it is bad) and true negative (saying a loan is good when it is good). The rate at which we detect true positive is called *sensitivity*. It is calculated as the share of positive outcomes (bad loans) that were correctly classified as bad. Let’s calculate it using our 0.8 probability as a cutoff.

\(sensitivity = \frac{TP}{TP+FN}=\frac{427}{1245}= 0.34\)

The rate at which we detect true negatives is called *specificity*. It is the share of negative outcomes (good loans) that were correctly classified as good.

\(specificity = \frac{TN}{TN+FP}=\frac{5997}{7262}=0.83\)

We correctly specified 34% bad loans as bad, and we correctly specified 83% of good loans as good.

Suppose we decide to use a stricter criteria for classifying a loan as good, say 0.90 probability instead of 0.80.

```
test$status_pred90 <- ifelse(test$pred > 0.90, "good", "bad")
crosstab(test$status,test$status_pred90, prop.t=TRUE, plot=FALSE)
```

```
## Cell Contents
## |-------------------------|
## | Count |
## | Total Percent |
## |-------------------------|
##
## ====================================
## test$status_pred90
## test$status bad good Total
## ------------------------------------
## bad 1098 147 1245
## 12.9% 1.7%
## ------------------------------------
## good 5387 1875 7262
## 63.3% 22.0%
## ------------------------------------
## Total 6485 2022 8507
## ====================================
```

Do you think this is a better prediction? What happened to sensitivity and specificity?

\(sensitivity = \frac{TP}{TP+FN}=\frac{1098}{1245}= 0.88\)

\(specificity = \frac{TN}{TN+FP}=\frac{1875}{7262}=0.26\)

We classified 88% of bad loans correctly as bad – a huge increase in sensitivity, our ability to correctly detect bad loans. However, this came at the cost of reducing our specificity which dropped from 83% to 26%. This is the fundamental trade-off between sensitivity and specificity. As we applied more stringent criterion for a loan to be good, we inevitably classified some good loans as bad.

`confusionMatrix()`

Where to strike the balance between sensitivity and specificity depends on the context of each problem. How many good loans are we willing to pass in order to avoid financing one bad one? Is there a statistic that would summarize the quality of our predictions in one number? Yes, it is called *kappa*. *Kappa* summarizes how much of our accuracy is due to chance (i. e. randomly assigning classes but keeping the final count of each class proportional to the count of each class in the training data). Kappa ranges from 0 to 1. We will calculate kappa using function `confusionMatrix()`

from the `caret`

package. The function takes the vector of class predictions as the *first* argument and vector of actual classes as the *second* argument.

```
library(caret)
confusionMatrix(test$status_pred90, test$status)
```

```
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 1098 5387
## good 147 1875
##
## Accuracy : 0.3495
## 95% CI : (0.3393, 0.3597)
## No Information Rate : 0.8536
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0511
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8819
## Specificity : 0.2582
## Pos Pred Value : 0.1693
## Neg Pred Value : 0.9273
## Prevalence : 0.1464
## Detection Rate : 0.1291
## Detection Prevalence : 0.7623
## Balanced Accuracy : 0.5701
##
## 'Positive' Class : bad
##
```

We see that `confusionMatrix`

calculates a number of performance metrics including, accuracy (83%), sensitivity, specificity, and kappa.

IN-CLASS EXERCISE:

1. Is kappa higher if we use 0.80 as a cutoff?

2. Is kappa higher if we use a logistic model with just `fico`

and `dti`

and 0.80 as a cutoff?

One way to visualize the trade-off between specificity and sensitivity it to plot their combinations. We know that as sensitivity rises, specificity will decline. The locus of these combinations is called an ROC curve. Before we plot it using package `pROC`

answer these questions:

What would happen to sensitivity and sensitivity if we classify all loans as bad?

What would happen to sensitivity and sensitivity if we classify all loans as good?

```
library(pROC)
#creates an object with all sorts of diagnostics including sensitivities and specificities
roc <- roc(test$status,test$pred)
#we need to create a data frame that contains the sensitivities and specificities
roc_curves <- data.frame(sens=roc$sensitivities,
spec=roc$specificities,
model="logit 4")
ggplot(roc_curves, aes(x=spec, y=sens, color=model)) + geom_line()
```

Let’s plot ROC curves for `logit4`

and `logit5`

models in the same graph. We will be able to see each algorithm’s performance in terms of sensitivity and specificity.

```
logit5 <- glm(status ~ fico + dti, data = train, family = "binomial")
test$pred2 <- predict(logit5, test, type="response")
roc <- roc(test$status,test$pred2)
roc_curve2 <- data.frame(sens=roc$sensitivities,
spec=roc$specificities,
model="logit 5")
roc_curves <- bind_rows(roc_curves, roc_curve2)
ggplot(roc_curves, aes(x=spec, y=sens, color=model)) + geom_line()
```

Clearly, `logit4`

algorithm performs better as it has higher sensitivity *and* specificity across all of the ranges. This is not surprising give that `logit4`

has all of the variables including in `logit5`

.

Let’s load the NHL data with cumulative wins information (

`nhl with cum wins.csv`

). Set seed to 364. Create train data set using games before March 15, 2016, and test data set would include data/games after March 15. Thus, we will try to predict a lot more games than we did in the past, 194 to be precise.Estimate a logit model using

`cumwins_net`

as the only explanatory variable. Is the effect of`cumwins_net`

on the odds of home team winning statistically significant? What is the magnitude of the effect?Calculate the probability of the home team winning for every game after and including March 15, i.e. all the games in the test data set. The outcome of which two games are you most confident predicting?

Make predictions for the

*outcomes*(`homewin`

) of the games in the test data. Use 0.5 as the cutoff probability for home team winning or losing. Are you happy with the performance of these predictions? What is your Kappa?Can you change the probability cutoff and improve on your predictions? Explain your answer.

If

`positive`

class is that home team loses, what is your sensitivity? What probability cutoff would you choose to make sensitivity really high? What will happen to your specificity?Plot the ROC curve for your logit model with

`cumwins_net`

as the only predictor.Estimate an alternative logit model with

`cumwins_home`

and`cumwins_visit`

as the only predictors. Plot the ROC curve for this and the model from question 7. Which model do you prefer, if any?