Lab 4: Stacking data frames, comparing means, plotting, ifelse(), tabulating frequencies

Learning objectives:

  • appending (as opposed to merging) data frames using bind_rows() function
  • comparing means using t.test()
  • plotting two variables
  • creating a new variable using ifelse() and case_when() functions
  • tabulating frequencies using table() and crosstab()

0. Introduction

Let’s begin by reading in our NHL regular 2016 season data and the 2016 playoffs. Check out the variables and the number of observations.

library(tidyverse)
library(lubridate)
library(stargazer)
nhl <- read_csv("https://dvorakt.github.io/business_analytics/lab1/NHLseason2016.csv")
playoffs <- read_csv("https://dvorakt.github.io/business_analytics/lab1/NHLplayoffs2016.csv")

1. Appending (as opposed to merging/joining) data frames using bind_rows() function

In Lab 3 we learned how to merge data, i.e. putting two data frames side by side. If we wanted to bring the regular season games and the playoffs together into one data frame, we would want to stack one data frame on top of the other, i.e. append, or bind rows. However, before we do that, let’s create a new variable that will mark whether the game was part of the regular season or playoffs. (Let’s call it regvsplay.)

nhl <- nhl %>% mutate(regvsplay = "regular")
playoffs <- playoffs %>% mutate(regvsplay = "playoffs")

The function we use to stack two data frames on top of each other is bind_rows() and it takes the two data frames as arguments. The names of the variables in each data frame should match so that corresponding columns are stacked on top of each other.

nhl <- bind_rows(nhl, playoffs)

How many observations do we now have in nhl?

2. Testing for differences in means

Suppose we are interested in the difference between average attendance at regular season games vs attendance at playoff games. We can use feed stargazer filtered NHL data, but if we want to also see if the difference is statistically significant, we can use t.test(). The argument t.test() takes is a formula. The syntax for entering a formula is y ~ x where y is a numeric variable and x is character variable that identifies the two groups.

nhl <- as.data.frame(nhl)
stargazer(filter(nhl, regvsplay=="regular"), type="text")
## 
## =====================================================================
## Statistic     N      Mean    St. Dev.   Min  Pctl(25) Pctl(75)  Max  
## ---------------------------------------------------------------------
## goals_home  1,230   2.815      1.597     0      2        4       8   
## goals_visit 1,230   2.609      1.551     0      1        4       9   
## attendance  1,230 17,571.820 3,142.400 9,021  16,440   19,070  67,246
## ---------------------------------------------------------------------
stargazer(filter(nhl, regvsplay=="playoffs"), type="text")
## 
## ===================================================================
## Statistic   N     Mean    St. Dev.   Min   Pctl(25) Pctl(75)  Max  
## -------------------------------------------------------------------
## goals_home  91   2.670      1.484     0       2        4       6   
## goals_visit 91   2.582      1.585     0       1       3.5      6   
## attendance  91 18,512.510 1,243.239 15,795  17,562   19,092  22,260
## -------------------------------------------------------------------
t.test(nhl$attendance ~ nhl$regvsplay)
## 
##  Welch Two Sample t-test
## 
## data:  nhl$attendance by nhl$regvsplay
## t = 5.9479, df = 192.04, p-value = 1.263e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   628.7419 1252.6333
## sample estimates:
## mean in group playoffs  mean in group regular 
##               18512.51               17571.82

Do you remember how to interpret a t-stat, or the p-value? Is the difference in attendance statistically significant?

3. Plotting two lines on one graph (adding layers vs using a variable to map color)

In Lab 3 we plotted the GE and AAPL in one graph. We did that by including two geom_line() layers, and asking each layer to use a different variable for what should be mapped to y axis (..+geom_line(aes(y=SP500)) + geom_line(aes(y=NASDQ))). We used two layers and two geom_ functions because we had two variables. Suppose we have just one variable, X, that we want to plot, but there is another variable, Y, that identifies different groups of observations. If we want to plot X by different groups identified by Y, we can map color to Y. For example, if we want to plot attendance by whether or not the game is a regular season game or a playoff game, we can do the following.

ggplot(nhl) + geom_density(aes(x=attendance, color=regvsplay), adjust=3)

Notice that we included regvsplay in the aes() function telling ggplot to map attendance to x axis and regvsplay to color.

In contrast, suppose we wanted to plot the density of goals by home team vs the density of goals by visiting team. The way our data is organized we have two variables, so we are going to use two geom_ functions.

ggplot(nhl) + geom_density(aes(x=goals_home), color="blue", adjust=3) +
              geom_density(aes(x=goals_visit), color="red", adjust=3) +
              xlab("goals")

IN-CLASS EXERCISE:

  1. Use function wday(nhl$Date, label=TRUE) from package lubridate to create day of the week variable in your NHL data.

  2. Plot densities of attendance by day of the week. Which days have the highest attendance? Which days have the most reliable attendance?

4. Creating a new variable using the ifelse() and case_when() functions

Suppose I wanted to count wins by the home team. Right now, all we have is goals by each team. Let’s create a variable that equals 1 if the home team wins and 0 if the home team loses. (We would have to rethink this if there were ties in NHL.) We can use the ifelse() function. Its first argument is a logical condition, the second argument is the value the function returns when the logical condition is true, the third argument is the value the function returns when the logical condition is false.

nhl  <- nhl %>% mutate(homewin = ifelse(goals_home-goals_visit>0,1,0))

What percentage of home games are won?

summary(nhl$homewin)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5276  1.0000  1.0000

case_when() is similar to ifelse() but is particularly useful when we have more than one logical condition on which to create a new variable. case_when squentially evaluates listed logical conditions and assigns value after ~.

nhl <- nhl %>% 
  mutate(outcome = case_when(
  goals_home-goals_visit > 3 ~ "big home win",
  goals_home-goals_visit > 0 ~ "home win", 
  goals_home-goals_visit < -3 ~ "big home loss",
  goals_home-goals_visit < 0 ~ "home loss"))

Notice that the order in which the conditions are listed matters. Our first condition assigns values “big home win” when goals_home-goal_visit > 3. Our second condition is goals_home-goal_visit > 0 encompases goals_home-goal_visit > 3 but since those values have already been assigned, only home wins by one to three goals are “home wins”.

5. Tabulating frequencies of different values a variable takes on (using table() and crosstab())

Often times we want to tabulate the values that a variable takes on. Function table() does that.

table(nhl$homewin)
## 
##   0   1 
## 624 697
table(nhl$outcome)
## 
## big home loss  big home win     home loss      home win 
##            63            87           561           610

Function crosstab() from package descr can do nice cross-tabulations (i.e. cross tabulating frequencies across two variables)

library(descr)
## Warning: package 'descr' was built under R version 3.6.3
crosstab(nhl$homewin, nhl$regvsplay, plot=FALSE)
##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |-------------------------|
## 
## =========================================
##                nhl$regvsplay
## nhl$homewin    playoffs   regular   Total
## -----------------------------------------
## 0                    45       579     624
## -----------------------------------------
## 1                    46       651     697
## -----------------------------------------
## Total                91      1230    1321
## =========================================

IN-CLASS EXERCISE:

  1. How would you create a variable that would identify each team as either Canadian or U.S.?

  2. Is attendance higher at Canadian home games than at U.S. home games?



Exercises

  1. Load in the merged NYC weather and DJI data from the previous lab. You can run your previous lab and add function write_csv() at the end to save your data. At the beginning of this lab, you can use read_csv() to load the data in. Make sure that you are in your business analytics project so that R know which folder to read the data from.

  2. Create a new variable that takes on value yes if there was some precipitation and no otherwise. How many days were there with precipiation compared to days without?

  3. What is the mean return on days when there was precipitation compared to when there was no precipitation? Is the difference between mean DJI return with or without precipitation statistically significant? (Hint: You have to do a difference in means test here. If you remember the formula, you can do it manually. If you don’t remember the formula or don’t feel like punching the numbers into the calculator, you can use t.test().)

  4. Plot the densities of daily returns when there is no precipitation overlayed with density of returns when there is precipitation. (Hint: Returns are volatile and sometimes can take extreme values. If you want your plot to only display density withing a certain range you can do so buy adding this to your plot + coord_cartesian(xlim = c(-2, 2))).

  5. What do the two densities show - what is your interpretation of this analysis?