glm()
handles categorical variablesIn lab 8 we mentioned a number of shortcomings of the kNN algorithm: it is a bit of a black box (we don’t know the influence of each feature), it tells us nothing about the reliability of predictions, and it can’t handle qualitative variables. In this lab we introduce a method that can do all these things: logistic regression. We already cleaned the Lending Club data so we will use it again.
library(tidyverse)
library(stargazer)
library(descr)
loan <- read_csv("lending_club_cleaned.csv")
Logistic regression is model where the dependent variable is takes only two values: 0 and 1. It estimates how the odds of the dependent variable being one depend on the values of independent variables. Function glm()
(part of base R) lets us estimate this type of model. GLM stands for generalized linear model. Its first argument is a formula in the Y ~ X1 + X2
format where Y
is the variable we try to predict (the dependent variable), and X1
, X2
are the predictor (independent) variables. The second argument specifies the data and in the third argument we specify we want a “binomial” model which tells gml()
to fit a logistic function to the data. The dependent variable (status
) needs to be “factor” with two levels (“bad” and “good”). gml()
will treat the first level as 0 and the second level as 1. Thus, we are estimating the probability of a loan being “good”.
loan$status <- as.factor(loan$status)
logit1 <- glm(status ~ fico, data = loan, family = "binomial")
summary(logit1)
##
## Call:
## glm(formula = status ~ fico, family = "binomial", data = loan)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5172 0.4078 0.5306 0.6294 0.9622
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.033068 0.296922 -23.69 <2e-16 ***
## fico 0.012358 0.000421 29.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35928 on 42534 degrees of freedom
## Residual deviance: 34985 on 42533 degrees of freedom
## AIC: 34989
##
## Number of Fisher Scoring iterations: 5
The output of the logistic regression is similar to the output of a standard linear regression. We see coefficients, their standard errors, p-values, etc. We see that fico
score has a positive and statistically significant effect on the odds of a loan being good. Interpreting the magnitude of the effect is a bit complex, and we will tackle it below.
Before interpreting the logit coefficients, let’s clarify the difference between odds and probability. The likelihood of an event can be expressed as either odds of the event happening or the probability of the event happening. Odds is the ratio of the number of times an event happens relative to the number times it does not happen. Probability is the ratio of the number of times an event happens relative to the sum of the number of times it happens and does not happen. Table below illustrates the difference.
Event | Probability | Odds |
---|---|---|
getting tails in a coin toss | 1/2 | 1/1 |
rolling a six on a die | 1/6 | 1/5 |
rolling a number greater than 2 | 4/6 | 4/2 |
Probabilities ranges from 0 to 1 whereas odds range from 0 to infinity. In either case, the higher the probability/odds, the greater the likelihood of the event happening.
The coefficient on fico
tells us that a one point increase in fico
score raises the log of odds ratio by 0.01. Log of odds is rather hard to interpret, therefore, we often take the exponential of the coefficients.
exp(coef(logit1))
## (Intercept) fico
## 0.0008822209 1.0124345145
The exponential of the coefficient tells us the factor by which odds increase if the independent variable increases by one. So, if a fico
score increases by 1 point, the odds ratio of loan being good increases by factor of 1.012 or 1.2%. Here is the math behind this: Logistic model estimates the following equation: \(log(\tfrac{p}{1-p})=\beta_0 + \beta_1 \cdot X\). Taking exponent on both sides we get \(\tfrac{p}{1-p}=e^{\beta_0 + \beta_1 \cdot X}\). If \(X\) increases by one, the odds become \(\tfrac{p}{1-p}=e^{\beta_0} \cdot e^{\beta_1 \cdot x} \cdot e^{\beta_1}\). This is equal to the initial odds multiplied by \(e^{\beta_1}\). Thus, when \(X\) goes up by 1 odds increase by factor of \(e^{\beta_1}\).
IN-CLASS EXERCISE 1: What is the effect of debt to income ratio (dti) on the odds of a loan being good?
While the interpretation in terms of odds is relatively easy, sometimes interpretation in terms of probability is desired. However, calculating how a change in fico score affects the probability of a good loan is a bit complicated. The relationship is non-linear and therefore the effect depends on the value of fico score (and other variables, if they are included). Therefore, we need to specify the levels of fico
we want to see the effect of. For example, suppose we wanted to know the effect of fico
going from 700 to 750 on the probability of loan being good. We could plug these value into the estimated regression and calculate the probability associated with each value of fico
.
We can ask the computer to do this for us. Function predict()
can plug data into an estimated model and calculate predicted values. The first argument is an estimated model, the second argument contains data with columns that match the predictor columns in the model. In the case of a logit model, option type="response"
which tells predict
to return probabilities (rather than the log of odds).
test <- data.frame(fico=c(700,750))
test$pred <- predict(logit1,test, type="response")
test
## fico pred
## 1 700 0.8344391
## 2 750 0.9033761
The fico score going from 700 to 750 increases the probability of a good loan by seven percentage points.
IN-CLASS EXERCISE 2: What is the effect of fico going from 750-800?
Let’s add an additional explanatory/independent variable, loan_amnt
.
logit2 <- glm(status ~ fico + loan_amnt, data = loan, family = "binomial")
summary(logit2)
##
## Call:
## glm(formula = status ~ fico + loan_amnt, family = "binomial",
## data = loan)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6261 0.4011 0.5256 0.6261 0.9423
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.367e+00 3.007e-01 -24.50 <2e-16 ***
## fico 1.319e-02 4.306e-04 30.62 <2e-16 ***
## loan_amnt -2.229e-05 1.815e-06 -12.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35928 on 42534 degrees of freedom
## Residual deviance: 34838 on 42532 degrees of freedom
## AIC: 34844
##
## Number of Fisher Scoring iterations: 5
exp(coef(logit2))
## (Intercept) fico loan_amnt
## 0.0006316636 1.0132732517 0.9999777084
It is important to keep in mind that once you include more than one independent variable, the interpretation of the coefficient on an independent variables is the effect of that independent variable holding all other independent variables constant. For example, holding loan amount constant, the effect of raising fico score by 1 point raises the odds of a good loan by 1.3 percent. This is a bit higher than when we did not control for loan amount, it suggests that fico and loan amount are related. Perhaps people with higher fico scores get approved for larger loans, but larger loans are also more likely to default.
Let’s see how gml()
handles categorical/factor variables. Let’s include purpose
into the model.
logit3 <- glm(status ~ fico + loan_amnt + purpose, data = loan, family = "binomial")
summary(logit3)
##
## Call:
## glm(formula = status ~ fico + loan_amnt + purpose, family = "binomial",
## data = loan)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6865 0.3882 0.5136 0.6193 1.2974
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.508e+00 3.030e-01 -24.780 < 2e-16 ***
## fico 1.356e-02 4.361e-04 31.093 < 2e-16 ***
## loan_amnt -2.311e-05 1.898e-06 -12.177 < 2e-16 ***
## purposeeducational -5.552e-01 1.233e-01 -4.503 6.69e-06 ***
## purposehome_improvement -1.404e-01 5.245e-02 -2.677 0.007418 **
## purposemajor_purchase 5.918e-02 5.651e-02 1.047 0.295023
## purposemedical -3.498e-01 1.005e-01 -3.481 0.000499 ***
## purposeother -3.347e-01 4.276e-02 -7.827 4.99e-15 ***
## purposesmall_business -9.380e-01 5.472e-02 -17.140 < 2e-16 ***
## purposevacation_wedding 1.160e-01 8.625e-02 1.345 0.178738
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35928 on 42534 degrees of freedom
## Residual deviance: 34502 on 42525 degrees of freedom
## AIC: 34522
##
## Number of Fisher Scoring iterations: 5
Note that purpose
is a character variable with eight possible values. glm()
is showing us only seven coefficients associated with purpose. Why? Do you recall the term dummy variable trap? To see the effect of \(k\) categories you only need \(k-1\) dummies. The interpretation of the coefficient on each of the included dummies is the effect of that category relative to the ‘left out’ category. In our case the category that is left out is debt_consolidation
so the coefficients on eductional
, home_improvement
etc. are relative to debt_consolidation
loans. Let’s take an exponent of the coefficients so we can interpret the magnitude of these coefficients.
round(exp(coef(logit3)),3)
## (Intercept) fico loan_amnt
## 0.001 1.014 1.000
## purposeeducational purposehome_improvement purposemajor_purchase
## 0.574 0.869 1.061
## purposemedical purposeother purposesmall_business
## 0.705 0.716 0.391
## purposevacation_wedding
## 1.123
The coefficients show that the odds of a loan being good are lower for educational loans relative to debt consolidation loans by a factor of 0.57, or about 43% lower. As another example, let’s take vacation and wedding loans. These have shockingly 12% higher odds of being good than debt consolidation loans.
In many cases we would like to have control over which category is left out of the regression. By default the category that is left out of the regression is the first level of the factor. In this case it was debt_consolidation since it was first in the alphabetical order of the eight factor levels. It is typical to leave out the category with the largest number of observations. The largest category is normally a good reference to which other categories may be compared. (For example, we normally compare earnings of black, Hispanic and Asian workers to white workers - white being the left-out category.) Here debt consolidation happens to be the biggest category so we did not feel the need to change the order of the factor levels. If we needed to, we could have done it using function factor(factorname, levels =c("","",...))
or if we wanted to automatically order factor levels by the number of cases in each factor level we could have done something like this:
loan <- loan %>% group_by(purpose) %>% mutate(nobs=n())
loan$purpose <- reorder(loan$purpose, -loan$nobs)
levels(loan$purpose)
## [1] "debt_consolidation" "other" "major_purchase"
## [4] "home_improvement" "small_business" "vacation_wedding"
## [7] "medical" "educational"
The levels are no longer in alphabetical order. Instead, they are now ordered from the category with the largest number of observations (debt_consolidation) to the category with the smallest number of observations (educational).
Suppose we wanted to include income as an explanatory variable. Recall that income is verified in only about half of the cases and therefore about half of the values for this variable are missing.
logit4 <- glm(status ~ fico + loan_amnt + income + purpose, data = loan, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logit4)
##
## Call:
## glm(formula = status ~ fico + loan_amnt + income + purpose, family = "binomial",
## data = loan)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3659 0.3840 0.5224 0.6320 1.2589
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.482e+00 4.119e-01 -18.165 < 2e-16 ***
## fico 1.303e-02 5.906e-04 22.061 < 2e-16 ***
## loan_amnt -3.663e-05 2.580e-06 -14.200 < 2e-16 ***
## income 7.203e-06 5.426e-07 13.275 < 2e-16 ***
## purposeother -2.988e-01 6.034e-02 -4.952 7.34e-07 ***
## purposemajor_purchase 1.388e-02 7.689e-02 0.180 0.8568
## purposehome_improvement -1.077e-01 7.108e-02 -1.515 0.1298
## purposesmall_business -9.158e-01 7.016e-02 -13.052 < 2e-16 ***
## purposevacation_wedding 1.114e-01 1.126e-01 0.989 0.3226
## purposemedical -2.678e-01 1.426e-01 -1.878 0.0604 .
## purposeeducational -5.076e-01 2.309e-01 -2.198 0.0279 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20592 on 23776 degrees of freedom
## Residual deviance: 19677 on 23766 degrees of freedom
## (18758 observations deleted due to missingness)
## AIC: 19699
##
## Number of Fisher Scoring iterations: 5
Logit omits observations/rows with missing values for any of the independent variables. Thus, the model above is estimated on only half of the observations in our data set. Perhaps most importantly, this model may not be used for predicting the status of loans that have missing income.
As we estimate several different regression models, it is common to present them in one compact table. Stargazer
can take our logit objects and summarize the results very nicely:
stargazer(logit1,logit2, logit3, logit4, type="text")
##
## =======================================================================
## Dependent variable:
## -----------------------------------------------
## status
## (1) (2) (3) (4)
## -----------------------------------------------------------------------
## fico 0.012*** 0.013*** 0.014*** 0.013***
## (0.0004) (0.0004) (0.0004) (0.001)
##
## loan_amnt -0.00002*** -0.00002*** -0.00004***
## (0.00000) (0.00000) (0.00000)
##
## income 0.00001***
## (0.00000)
##
## purposeeducational -0.555*** -0.508**
## (0.123) (0.231)
##
## purposehome_improvement -0.140*** -0.108
## (0.052) (0.071)
##
## purposemajor_purchase 0.059 0.014
## (0.057) (0.077)
##
## purposemedical -0.350*** -0.268*
## (0.100) (0.143)
##
## purposeother -0.335*** -0.299***
## (0.043) (0.060)
##
## purposesmall_business -0.938*** -0.916***
## (0.055) (0.070)
##
## purposevacation_wedding 0.116 0.111
## (0.086) (0.113)
##
## Constant -7.033*** -7.367*** -7.508*** -7.482***
## (0.297) (0.301) (0.303) (0.412)
##
## -----------------------------------------------------------------------
## Observations 42,535 42,535 42,535 23,777
## Log Likelihood -17,492.360 -17,419.180 -17,250.840 -9,838.745
## Akaike Inf. Crit. 34,988.710 34,844.370 34,521.690 19,699.490
## =======================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Survived
- Survival (0 = No; 1 = Yes)Pclass
- Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)Name
- NameSex
- SexAge
- Age (in years)SibSp
- Number of Siblings/Spouses AboardParch
- Number of Parents/Children AboardTicket
- Ticket NumberFare
- Passenger FareCabin
- CabinEmbarked
- Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)What are the odds of surviving the shipwreck?
Using the logit model, estimate how much lower are the odds of survival for men relative to women?
Controlling for gender, does age have a statistically significant effect on the odds of survival? If so, what is the magnitude of that effect?
Controlling for gender, does passenger class have a statistically significant effect on the odds of survival? If so, what is the magnitude of that effect?
Controlling for gender, estimate the effect of being in the second class relative to first class, and the effect of being in the third relative to first. (Hint: Turn Pclass
into a factor.)
Add fare
to the regression you estimated above. Is fare a significant determinant of survival controlling for gender and passenger class? Do you think that if we regressed survival on just gender and fare, fare would be significant? Explain.
As we know from the movie, Jack traveled in the third class and paid 5 pounds (I know that Jack actually won the ticket in poker, but Swen, from whom Jack won the ticket, paid …). Rose traveled in the first class and paid 500 for her ticket (I know that her fiancee, Cal Hockley - Pittsburgh steel tycoon, actually bought the ticket, but …). What is the probability that Jack will survive? What is the probability that Rose will survive?