Lab 9: Logistic Regression

Learning Objectives:

estimate a logistic regression
odds vs. probability
interpret coefficients (as they affect odds and probability)
understand how glm() handles categorical variables

1. Introduction

In lab 8 we mentioned a number of shortcomings of the kNN algorithm: it is a bit of a black box (we don’t know the influence of each feature), it tells us nothing about the reliability of predictions, and it can’t handle qualitative variables. In this lab we introduce a method that can do all these things: logistic regression. We already cleaned the Lending Club data so we will use it again.

library(tidyverse)
library(stargazer)
library(descr)
loan <- read_csv("lending_club_cleaned.csv")

2. Estimating logistic regression

Logistic regression is model where the dependent variable is takes only two values: 0 and 1. It estimates how the odds of the dependent variable being one depend on the values of independent variables. Function glm() (part of base R) lets us estimate this type of model. GLM stands for generalized linear model. Its first argument is a formula in the Y ~ X1 + X2 format where Y is the variable we try to predict (the dependent variable), and X1, X2 are the predictor (independent) variables. The second argument specifies the data and in the third argument we specify we want a “binomial” model which tells gml() to fit a logistic function to the data. The dependent variable (status) needs to be “factor” with two levels (“bad” and “good”). gml() will treat the first level as 0 and the second level as 1. Thus, we are estimating the probability of a loan being “good”.

loan$status <- as.factor(loan$status)
logit1 <- glm(status ~ fico, data = loan, family = "binomial")
summary(logit1)

## 
## Call:
## glm(formula = status ~ fico, family = "binomial", data = loan)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5172   0.4078   0.5306   0.6294   0.9622  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.033068   0.296922  -23.69   <2e-16 ***
## fico         0.012358   0.000421   29.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35928  on 42534  degrees of freedom
## Residual deviance: 34985  on 42533  degrees of freedom
## AIC: 34989
## 
## Number of Fisher Scoring iterations: 5

The output of the logistic regression is similar to the output of a standard linear regression. We see coefficients, their standard errors, p-values, etc. We see that fico score has a positive and statistically significant effect on the odds of a loan being good. Interpreting the magnitude of the effect is a bit complex, and we will tackle it below.

3. Odds vs. probability

Before interpreting the logit coefficients, let’s clarify the difference between odds and probability. The likelihood of an event can be expressed as either odds of the event happening or the probability of the event happening. Odds is the ratio of the number of times an event happens relative to the number times it does not happen. Probability is the ratio of the number of times an event happens relative to the sum of the number of times it happens and does not happen. Table below illustrates the difference.

Event	Probability	Odds
getting tails in a coin toss	1/2	1/1
rolling a six on a die	1/6	1/5
rolling a number greater than 2	4/6	4/2

Probabilities ranges from 0 to 1 whereas odds range from 0 to infinity. In either case, the higher the probability/odds, the greater the likelihood of the event happening.

4. Interpreting coefficients in logit regression I: odds ratio

The coefficient on fico tells us that a one point increase in fico score raises the log of odds ratio by 0.01. Log of odds is rather hard to interpret, therefore, we often take the exponential of the coefficients.

exp(coef(logit1))

##  (Intercept)         fico 
## 0.0008822209 1.0124345145

The exponential of the coefficient tells us the factor by which odds increase if the independent variable increases by one. So, if a fico score increases by 1 point, the odds ratio of loan being good increases by factor of 1.012 or 1.2%. Here is the math behind this: Logistic model estimates the following equation: \(log(\tfrac{p}{1-p})=\beta_0 + \beta_1 \cdot X\). Taking exponent on both sides we get \(\tfrac{p}{1-p}=e^{\beta_0 + \beta_1 \cdot X}\). If \(X\) increases by one, the odds become \(\tfrac{p}{1-p}=e^{\beta_0} \cdot e^{\beta_1 \cdot x} \cdot e^{\beta_1}\). This is equal to the initial odds multiplied by \(e^{\beta_1}\). Thus, when \(X\) goes up by 1 odds increase by factor of \(e^{\beta_1}\).

IN-CLASS EXERCISE 1: What is the effect of debt to income ratio (dti) on the odds of a loan being good?

5. Interpreting coefficients in logit regression II: probability

While the interpretation in terms of odds is relatively easy, sometimes interpretation in terms of probability is desired. However, calculating how a change in fico score affects the probability of a good loan is a bit complicated. The relationship is non-linear and therefore the effect depends on the value of fico score (and other variables, if they are included). Therefore, we need to specify the levels of fico we want to see the effect of. For example, suppose we wanted to know the effect of fico going from 700 to 750 on the probability of loan being good. We could plug these value into the estimated regression and calculate the probability associated with each value of fico.

We can ask the computer to do this for us. Function predict() can plug data into an estimated model and calculate predicted values. The first argument is an estimated model, the second argument contains data with columns that match the predictor columns in the model. In the case of a logit model, option type="response" which tells predict to return probabilities (rather than the log of odds).

test <- data.frame(fico=c(700,750))
test$pred <- predict(logit1,test, type="response")
test

##   fico      pred
## 1  700 0.8344391
## 2  750 0.9033761

The fico score going from 700 to 750 increases the probability of a good loan by seven percentage points.

IN-CLASS EXERCISE 2: What is the effect of fico going from 750-800?

6. Interpreting coefficients in a multiple regression

Let’s add an additional explanatory/independent variable, loan_amnt.

logit2 <- glm(status ~ fico + loan_amnt, data = loan, family = "binomial")
summary(logit2)

## 
## Call:
## glm(formula = status ~ fico + loan_amnt, family = "binomial", 
##     data = loan)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6261   0.4011   0.5256   0.6261   0.9423  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.367e+00  3.007e-01  -24.50   <2e-16 ***
## fico         1.319e-02  4.306e-04   30.62   <2e-16 ***
## loan_amnt   -2.229e-05  1.815e-06  -12.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35928  on 42534  degrees of freedom
## Residual deviance: 34838  on 42532  degrees of freedom
## AIC: 34844
## 
## Number of Fisher Scoring iterations: 5

exp(coef(logit2))

##  (Intercept)         fico    loan_amnt 
## 0.0006316636 1.0132732517 0.9999777084

It is important to keep in mind that once you include more than one independent variable, the interpretation of the coefficient on an independent variables is the effect of that independent variable holding all other independent variables constant. For example, holding loan amount constant, the effect of raising fico score by 1 point raises the odds of a good loan by 1.3 percent. This is a bit higher than when we did not control for loan amount, it suggests that fico and loan amount are related. Perhaps people with higher fico scores get approved for larger loans, but larger loans are also more likely to default.

7. Working with categorical/factor variables

Let’s see how gml() handles categorical/factor variables. Let’s include purpose into the model.

logit3 <- glm(status ~ fico + loan_amnt + purpose, data = loan, family = "binomial")
summary(logit3)

## 
## Call:
## glm(formula = status ~ fico + loan_amnt + purpose, family = "binomial", 
##     data = loan)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6865   0.3882   0.5136   0.6193   1.2974  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -7.508e+00  3.030e-01 -24.780  < 2e-16 ***
## fico                     1.356e-02  4.361e-04  31.093  < 2e-16 ***
## loan_amnt               -2.311e-05  1.898e-06 -12.177  < 2e-16 ***
## purposeeducational      -5.552e-01  1.233e-01  -4.503 6.69e-06 ***
## purposehome_improvement -1.404e-01  5.245e-02  -2.677 0.007418 ** 
## purposemajor_purchase    5.918e-02  5.651e-02   1.047 0.295023    
## purposemedical          -3.498e-01  1.005e-01  -3.481 0.000499 ***
## purposeother            -3.347e-01  4.276e-02  -7.827 4.99e-15 ***
## purposesmall_business   -9.380e-01  5.472e-02 -17.140  < 2e-16 ***
## purposevacation_wedding  1.160e-01  8.625e-02   1.345 0.178738    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35928  on 42534  degrees of freedom
## Residual deviance: 34502  on 42525  degrees of freedom
## AIC: 34522
## 
## Number of Fisher Scoring iterations: 5

Note that purpose is a character variable with eight possible values. glm() is showing us only seven coefficients associated with purpose. Why? Do you recall the term dummy variable trap? To see the effect of \(k\) categories you only need \(k-1\) dummies. The interpretation of the coefficient on each of the included dummies is the effect of that category relative to the ‘left out’ category. In our case the category that is left out is debt_consolidation so the coefficients on eductional, home_improvement etc. are relative to debt_consolidation loans. Let’s take an exponent of the coefficients so we can interpret the magnitude of these coefficients.

round(exp(coef(logit3)),3)

##             (Intercept)                    fico               loan_amnt 
##                   0.001                   1.014                   1.000 
##      purposeeducational purposehome_improvement   purposemajor_purchase 
##                   0.574                   0.869                   1.061 
##          purposemedical            purposeother   purposesmall_business 
##                   0.705                   0.716                   0.391 
## purposevacation_wedding 
##                   1.123

The coefficients show that the odds of a loan being good are lower for educational loans relative to debt consolidation loans by a factor of 0.57, or about 43% lower. As another example, let’s take vacation and wedding loans. These have shockingly 12% higher odds of being good than debt consolidation loans.

In many cases we would like to have control over which category is left out of the regression. By default the category that is left out of the regression is the first level of the factor. In this case it was debt_consolidation since it was first in the alphabetical order of the eight factor levels. It is typical to leave out the category with the largest number of observations. The largest category is normally a good reference to which other categories may be compared. (For example, we normally compare earnings of black, Hispanic and Asian workers to white workers - white being the left-out category.) Here debt consolidation happens to be the biggest category so we did not feel the need to change the order of the factor levels. If we needed to, we could have done it using function factor(factorname, levels =c("","",...)) or if we wanted to automatically order factor levels by the number of cases in each factor level we could have done something like this:

loan <- loan %>% group_by(purpose) %>% mutate(nobs=n()) 
loan$purpose <-  reorder(loan$purpose, -loan$nobs)
levels(loan$purpose)

## [1] "debt_consolidation" "other"              "major_purchase"    
## [4] "home_improvement"   "small_business"     "vacation_wedding"  
## [7] "medical"            "educational"

The levels are no longer in alphabetical order. Instead, they are now ordered from the category with the largest number of observations (debt_consolidation) to the category with the smallest number of observations (educational).

8. Dealing with missing values

Suppose we wanted to include income as an explanatory variable. Recall that income is verified in only about half of the cases and therefore about half of the values for this variable are missing.

logit4 <- glm(status ~ fico + loan_amnt + income + purpose, data = loan, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(logit4)

## 
## Call:
## glm(formula = status ~ fico + loan_amnt + income + purpose, family = "binomial", 
##     data = loan)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3659   0.3840   0.5224   0.6320   1.2589  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -7.482e+00  4.119e-01 -18.165  < 2e-16 ***
## fico                     1.303e-02  5.906e-04  22.061  < 2e-16 ***
## loan_amnt               -3.663e-05  2.580e-06 -14.200  < 2e-16 ***
## income                   7.203e-06  5.426e-07  13.275  < 2e-16 ***
## purposeother            -2.988e-01  6.034e-02  -4.952 7.34e-07 ***
## purposemajor_purchase    1.388e-02  7.689e-02   0.180   0.8568    
## purposehome_improvement -1.077e-01  7.108e-02  -1.515   0.1298    
## purposesmall_business   -9.158e-01  7.016e-02 -13.052  < 2e-16 ***
## purposevacation_wedding  1.114e-01  1.126e-01   0.989   0.3226    
## purposemedical          -2.678e-01  1.426e-01  -1.878   0.0604 .  
## purposeeducational      -5.076e-01  2.309e-01  -2.198   0.0279 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20592  on 23776  degrees of freedom
## Residual deviance: 19677  on 23766  degrees of freedom
##   (18758 observations deleted due to missingness)
## AIC: 19699
## 
## Number of Fisher Scoring iterations: 5

Logit omits observations/rows with missing values for any of the independent variables. Thus, the model above is estimated on only half of the observations in our data set. Perhaps most importantly, this model may not be used for predicting the status of loans that have missing income.

9. Presenting regression models in a compact table

As we estimate several different regression models, it is common to present them in one compact table. Stargazer can take our logit objects and summarize the results very nicely:

stargazer(logit1,logit2, logit3, logit4, type="text")

## 
## =======================================================================
##                                       Dependent variable:              
##                         -----------------------------------------------
##                                             status                     
##                             (1)         (2)         (3)         (4)    
## -----------------------------------------------------------------------
## fico                     0.012***    0.013***    0.014***    0.013***  
##                          (0.0004)    (0.0004)    (0.0004)     (0.001)  
##                                                                        
## loan_amnt                           -0.00002*** -0.00002*** -0.00004***
##                                      (0.00000)   (0.00000)   (0.00000) 
##                                                                        
## income                                                      0.00001*** 
##                                                              (0.00000) 
##                                                                        
## purposeeducational                               -0.555***   -0.508**  
##                                                   (0.123)     (0.231)  
##                                                                        
## purposehome_improvement                          -0.140***    -0.108   
##                                                   (0.052)     (0.071)  
##                                                                        
## purposemajor_purchase                              0.059       0.014   
##                                                   (0.057)     (0.077)  
##                                                                        
## purposemedical                                   -0.350***    -0.268*  
##                                                   (0.100)     (0.143)  
##                                                                        
## purposeother                                     -0.335***   -0.299*** 
##                                                   (0.043)     (0.060)  
##                                                                        
## purposesmall_business                            -0.938***   -0.916*** 
##                                                   (0.055)     (0.070)  
##                                                                        
## purposevacation_wedding                            0.116       0.111   
##                                                   (0.086)     (0.113)  
##                                                                        
## Constant                 -7.033***   -7.367***   -7.508***   -7.482*** 
##                           (0.297)     (0.301)     (0.303)     (0.412)  
##                                                                        
## -----------------------------------------------------------------------
## Observations              42,535      42,535      42,535      23,777   
## Log Likelihood          -17,492.360 -17,419.180 -17,250.840 -9,838.745 
## Akaike Inf. Crit.       34,988.710  34,844.370  34,521.690  19,699.490 
## =======================================================================
## Note:                                       *p<0.1; **p<0.05; ***p<0.01

Exercises

These exercises are based on a famous kaggle competition in which competitors predict who lives and who dies on the Titanic. Download the data set from Nexus, and load it into R. It has the following variables:

Survived - Survival (0 = No; 1 = Yes)
Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name - Name
Sex - Sex
Age - Age (in years)
SibSp - Number of Siblings/Spouses Aboard
Parch - Number of Parents/Children Aboard
Ticket - Ticket Number
Fare - Passenger Fare
Cabin - Cabin
Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

What are the odds of surviving the shipwreck?
Using the logit model, estimate how much lower are the odds of survival for men relative to women?
Controlling for gender, does age have a statistically significant effect on the odds of survival? If so, what is the magnitude of that effect?
Controlling for gender, does passenger class have a statistically significant effect on the odds of survival? If so, what is the magnitude of that effect?
Controlling for gender, estimate the effect of being in the second class relative to first class, and the effect of being in the third relative to first. (Hint: Turn Pclass into a factor.)
Add fare to the regression you estimated above. Is fare a significant determinant of survival controlling for gender and passenger class? Do you think that if we regressed survival on just gender and fare, fare would be significant? Explain.
As we know from the movie, Jack traveled in the third class and paid 5 pounds (I know that Jack actually won the ticket in poker, but Swen, from whom Jack won the ticket, paid …). Rose traveled in the first class and paid 500 for her ticket (I know that her fiancee, Cal Hockley - Pittsburgh steel tycoon, actually bought the ticket, but …). What is the probability that Jack will survive? What is the probability that Rose will survive?