This article has been published on Towards Data Science - this version includes the relevant R code.

If you are reading this, then you probably tried to predict who will survive the Titanic shipwreck. This Kaggle competition is a canonical example of machine learning, and a right of passage for any aspiring data scientist. What if instead of predicting who will survive, you only had to predict how many will survive? Or, what if you had to predict the average age of survivors, or the sum of the fare that the survivors paid?

There are many applications where classification predictions need to be aggregated. For example, a customer churn model may generate probabilities that a customer will churn, but the business may be interested in how many customers are predicted to churn, or how much revenue will be lost. Similarly, a model may give a probability that a flight will be delayed, but we may want to know how many flights will be delayed, or how many passengers are affected. Hong (2013) lists a number of other examples from actuarial assessment to warranty claims.

Most binary classification algorithms estimate probabilities that examples belong to the positive class. If we treat these probabilities as known values (rather than estimates), then the number of positive cases is a random variable with a Poisson Binomial probability distribution. (If the probabilities were all the same, the distribution would be Binomial.) Similarly, the sum of two-value random variables where one value is zero and the other value some other number (e.g. age, revenue) is distributed as a Generalized Poisson Binomial. Under these assumptions we can report mean values as well as prediction intervals. In summary, if we had the true classification probabilities, then we could construct the probability distributions of any aggregate outcome (number of survivors, age, revenue, etc.).

Of course, the classification probabilities we obtain from machine learning models are just estimates. Therefore, treating the probabilities as known values may not be appropriate. (Essentially, we would be ignoring the sampling error in estimating these probabilities.) However, if we are interested only in the aggregate characteristics of survivors, perhaps we should focus on estimating parameters that describe the probability distributions of these aggregate characteristics. In other words, we should recognize that we have a numerical prediction problem rather than a classification problem.

I compare two approaches to getting aggregate characteristics of Titanic survivors. The first is to classify and then aggregate. I estimate three popular classification models and then aggregate the resulting probabilities. The second approach is a regression model to estimate how aggregate characteristics of a group of passengers affect the share that survives. I evaluate each approach using many random splits of test and train data. The conclusion is that many classification models do poorly when the classification probabilities are aggregated.

1 Classify and aggregate approach

Let's use the Titanic data to estimate three different classifiers. The logistic model will use only age and passenger class as predictors; Random Forest and XGBoost will also use sex. I train the model on the 891 passengers in Kaggel's training data. I evaluate the predictions on the 418 in the test data. (I obtained the labels for the test set to be able to evaluate my models.)

library(tidyverse)
library(titanic)
library(randomForest)
library(xgboost)
library(pROC)
library(poibin)
library(GPB)
library(modelr)

#titanic3 comes from: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv and has all 1309 passengers labeled
#I match them to Kaggle test set using names and ticket number, but I need to get rid of " because they appear strangely placed in the test set
all <- read_csv("titanic3.csv") %>%
  mutate(name=str_trim(gsub('"','',name))) %>%
  rename(Name=name, Ticket=ticket, Survived=survived)

train <- titanic_train
test <- titanic_test  %>%
  mutate(Name=str_trim(gsub('"','',Name))) #get rid of quotes so that names can match to the labeled data
test <- inner_join(test, select(all, Survived, Name, Ticket), by=c("Name", "Ticket")) 

train <- train %>%
  mutate(Age =  ifelse(is.na(Age), mean(Age,na.rm = TRUE),Age), #fill in missing age
         sex_01 = ifelse(Sex=="male",1,0), #need numerical for XGBoost
         Fare = ifelse(is.na(Fare), mean(Fare, na.rm = TRUE),Fare)) #fill in missing fare
test <- test %>%
  mutate(Age =  ifelse(is.na(Age), mean(Age,na.rm = TRUE),Age),
         sex_01 = ifelse(Sex=="male",1,0),
         Fare = ifelse(is.na(Fare), mean(Fare, na.rm = TRUE),Fare)) 
set.seed(100)
log_model <- glm(as.factor(Survived) ~ Pclass + Age, train, family = "binomial")
rf_model <- randomForest(as.factor(Survived) ~ Pclass + Age + Sex, train)
xgb_model <- xgboost(data = as.matrix(select(train, Pclass, sex_01, Age)),
                  label = as.matrix(select(train, Survived)),
                  nrounds = 5, objective = "binary:logistic", verbose = 0)

log_probs <- predict(log_model, test, type = "response")
rf_probs <- predict(rf_model, test, type="prob")[,2]
xgb_probs <- predict(xgb_model, as.matrix(select(test,Pclass,sex_01, Age)))
auc <- data.frame(log_model=roc(test$Survived,log_probs)[['auc']],
                  rf_model=roc(test$Survived,rf_probs)[['auc']],
                  xgb_model=roc(test$Survived,xgb_probs)[['auc']])
auc
##   log_model  rf_model xgb_model
## 1 0.6661271 0.8001947 0.8084713

The logistic model with only age and passenger class as predictors has AUC of 0.66, but the Random Forest and XGBoost that also use sex reach a very respectable AUC around 0.8. However, classifying individual passengers is not our problem. Our problem is to predict how many passengers will survive. We can estimate this by adding up the probabilities that a passenger will survive. Interestingly, of the three classifiers the logistic model was the closest to the actual number of survivors despite having the lowest AUC. It is also worth noting that a naive estimate of the number of survivors based on the share of survivors in the training data did best of all.

#predictions from classification probabilities
predictions <- data.frame(log_model=round(sum(log_probs)),
                          rf_model=round(sum(rf_probs)),
                          xgb_model=round(sum(xgb_probs)), 
                          naive=round(mean(train$Survived)*nrow(test)),
                          actual=sum(test$Survived))
predictions
##   log_model rf_model xgb_model naive actual
## 1       162      131       168   160    158

Given the probabilities of survival for each passenger in the test set, the number of passengers that will survive is a random variable distributed Poisson Binomial. The mean of this random variable is the sum of the individual probabilities. The percentiles of this distribution can be obtained using the poibin package (developed by Hong (2013)). A similar Python package is under development. The percentiles can also be obtained through brute force by simulating a 1000 different sets of outcomes for the 418 passengers in the test set. The percentiles can be interpreted as prediction intervals telling us that the actual number of survivors will be within this interval with say 95% probability. The interval based on the Random Forest probabilities widely missed the actual number of survivors. It is worth noting that the width of the interval is not necessarily based on the accuracy of the individual predictions. Instead, it depends on how far the individual probabilities are from 0.5. Probabilities close to 0.9 or 0.1 rather than 0.5 mean that there is a lot less uncertainty as to how many passengers will actually survive.

#prediction interval for the number of survived
naive_mean <- mean(train$Survived)
naive_se <- (naive_mean)*(1-naive_mean)/sqrt(nrow(train))
intervals <- data.frame(log_model=qpoibin(c(0.025,0.975),log_probs),
                        rf_model=qpoibin(c(0.025,0.975),rf_probs),
                        xgb_model=qpoibin(c(0.025,0.975),xgb_probs),
                        naive = round(c((naive_mean-1.96*naive_se)*nrow(test),(naive_mean+1.96*naive_se)*nrow(test))))
rownames(intervals) <- c("lower", "upper")
intervals
##       log_model rf_model xgb_model naive
## lower       144      120       152   154
## upper       180      142       185   167

While the number of survivors is a sum of zero/one random variables (Bernoulli trials), we may also be interested in predicting other aggregate characteristics of the survivors, e.g. total fare paid by the survivors. This measure is a sum of two-value random variables where one value is zero (passenger did not survive) and the other one is the fare that the passenger paid. Zhang, Hong and Balakrishnan (2018) call the probability distribution of this sum Generalized Poisson Binomial. As with Poisson Binomial, Hong, co-wrote a package, GPB that makes computing the probability distributions straightforward. Once again, simulating the distribution is an alternative to using the packages to compute the percentiles.

#prediction interval for the sum of Fare of those survived
naive_mean <-mean(train$Survived*train$Fare)
naive_se <- sd(train$Survived*train$Fare)/sqrt(nrow(train))
aval <- rep(0,length(xgb_probs))
bval <- round(test$Fare)
intervals <- data.frame(log_model=qgpb(c(0.025, 0.975), log_probs, aval=aval, bval = bval),
                        rf_model=qgpb(c(0.025, 0.975), rf_probs, aval=aval, bval = bval),
                        xgb_model=qgpb(c(0.025, 0.975), xgb_probs, aval=aval, bval = bval),
                        naive=round(c((naive_mean-1.96*naive_se)*nrow(test),(naive_mean+1.96*naive_se)*nrow(test))))
rownames(intervals) <- c("lower", "upper")
intervals
##       log_model rf_model xgb_model naive
## lower      6448     7139      7167  6462
## upper      8964     8449      9287  9068
#actual
sum(test$Survived*test$Fare)
## [1] 8129.362

2 Aggregate regression approach

If we only care about the aggregate characteristics of survivors, then we really have a numerical prediction problem. The simplest estimate of the share of survivors in the test set is the share of survivors in the training set — it is the naive estimate from the previous section. This estimate is probably unbiased and efficient if the characteristics of passengers in the test and train sets are identical. If not, then we would want an estimate of the share of survivors conditional on the characteristics of the passengers.

The issue is that we don’t have the data to estimate how aggregate characteristics of a group of passengers affect the share that survived. After all, the Titanic hit the iceberg only once. Perhaps in other applications such as customer churn, we may have new data every month.

In the Titanic case I resort to simulating many different training data sets by re-sampling the original training data set. I calculate the average characteristics of each simulated data set to estimate of how these characteristics affect the share that will survive. I then take the average characteristics of passengers in the test set and predict how many will survive in the test set. There are many different ways one could summarize the aggregate characteristics. I use the share of passengers in first class, the share of passengers under the age of 10 and the share of female passengers.

#let's resample the training data 500 times and get the share of survived and average passenger characteristics for each sample
profiles <- data.frame(x=1:500)
for (i in 1:500){
  train_sample <- as.data.frame(resample_bootstrap(train))
  profiles$share_survived[i] <-mean(train_sample$Survived)
  profiles$share_women[i] <- mean(ifelse(train_sample$Sex=="female",1,0))
  profiles$share_1st_class[i] <- mean(ifelse(train_sample$Pclass==1,1,0))
  profiles$share_age_under10[i] <- mean(ifelse(train_sample$Age<10,1,0))
}
#estimate the relationship between share of survived and average passenger characteristics
reg_model <- lm(share_survived ~  share_women + share_1st_class + share_age_under10, profiles)
summary(reg_model)
## 
## Call:
## lm(formula = share_survived ~ share_women + share_1st_class + 
##     share_age_under10, data = profiles)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.039070 -0.008951 -0.000616  0.009091  0.034919 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.12906    0.01655   7.797 3.77e-14 ***
## share_women        0.52398    0.03834  13.666  < 2e-16 ***
## share_1st_class    0.21975    0.04275   5.140 3.95e-07 ***
## share_age_under10  0.25890    0.06790   3.813 0.000155 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01335 on 496 degrees of freedom
## Multiple R-squared:  0.3403, Adjusted R-squared:  0.3363 
## F-statistic: 85.29 on 3 and 496 DF,  p-value: < 2.2e-16

Not surprisingly, the samples of passengers that have more women, children and first class passengers have a higher share of survivors. Let's see how this relationship predicts the number of survivors in the test set.

#create average passenger characteristics in the test set
test_summary <- test %>% 
  summarise(share_survived=mean(Survived), mean_age=mean(Age), 
            share_women=mean(ifelse(test$Sex=="female",1,0)),
            share_1st_class = mean(ifelse(test$Pclass==1,1,0)),
            share_age_under10 = mean(ifelse(test$Age<10,1,0)),
            n=n())
#predicted based on regression
round(predict(reg_model,test_summary,interval = 'prediction')*nrow(test))
##   fit lwr upr
## 1 162 151 173

The regression approach did quite well, predicting 162 survivors against the actual of 158 with a prediction interval of 151 to 173.

3 How do the two approaches compare?

So far, we evaluated the two approaches using only one test set. In order to compare the two approaches more systematically, I re-sample from the union of the original train and test data set to create a five hundred new train and test data sets. I will then apply the two approaches a five hundred times and calculate the mean square error of each approach across these thousand samples.

## `summarise()` ungrouping output (override with `.groups` argument)

Among the classification models the logistic model did best (had the lowest MSE). XGBoost is a relatively close second. Random Forest is way off. The accuracy of aggregate predictions depends crucially on the accuracy of the estimated probabilities. The logistic regression directly estimates the probability of survival. Similarly, XGBoost optimizes a logistic loss function, therefore both provide a decent estimated of probabilities. In contrast, Random Forest estimates probabilities as shares of trees that classified the example as success. As pointed out by Olson and Wyner (2018), the share of trees that classified the example as success has nothing to do with the probability that the example will be a success. (For the same reason, calibration plots for Random Forest tend to be poor.) Although Random Forest can deliver high AUC, the estimated probabilities are inappropriate for aggregation.

The aggregate regression model had the lowest MSE of all the approaches beating even the classification logistic model. The naive predictions are handicapped in this evaluation because the share of survivors in the test data is not independent of the share of survivors train data. If we sample many survivors in the train, we will naturally have fewer survivors in the test. Even with this handicap naive predictions handily beat XGBoost and Random Forest.

4 Conclusion

If we only need aggregate characteristics, estimating and aggregating individual classification probabilities seems like more trouble than is needed. In many cases the share of survivors in the test is a pretty good estimate of survivors in the train; churn rate this month is probably a pretty good estimate of churn rate next month. More complicated models are worth building if we want to understand what drives survival or churn. It is also worth building more complicated models when our training data has very different characteristics than the test data, and when these characteristics affect survival or churn. Still, even in these cases, it is clear that using methods that are optimized for individual classifications could be inferior to methods optimized for numerical predictions when numerical prediction is needed.