ifelse()
, tabulating frequenciesbind_rows()
functiont.test()
ifelse()
and case_when()
functionstable()
and crosstab()
Let’s begin by reading in our NHL regular 2016 season data and the 2016 playoffs. Check out the variables and the number of observations.
library(tidyverse)
library(lubridate)
library(stargazer)
nhl <- read_csv("https://dvorakt.github.io/business_analytics/lab1/NHLseason2016.csv")
playoffs <- read_csv("https://dvorakt.github.io/business_analytics/lab1/NHLplayoffs2016.csv")
bind_rows()
functionIn Lab 3 we learned how to merge data, i.e. putting two data frames side by side. If we wanted to bring the regular season games and the playoffs together into one data frame, we would want to stack one data frame on top of the other, i.e. append, or bind rows. However, before we do that, let’s create a new variable that will mark whether the game was part of the regular season or playoffs. (Let’s call it regvsplay
.)
nhl <- nhl %>% mutate(regvsplay = "regular")
playoffs <- playoffs %>% mutate(regvsplay = "playoffs")
The function we use to stack two data frames on top of each other is bind_rows()
and it takes the two data frames as arguments. The names of the variables in each data frame should match so that corresponding columns are stacked on top of each other.
nhl <- bind_rows(nhl, playoffs)
How many observations do we now have in nhl
?
Suppose we are interested in the difference between average attendance at regular season games vs attendance at playoff games. We can use feed stargazer filtered NHL data, but if we want to also see if the difference is statistically significant, we can use t.test()
. The argument t.test()
takes is a formula. The syntax for entering a formula is y ~ x
where y
is a numeric variable and x
is character variable that identifies the two groups.
nhl <- as.data.frame(nhl)
stargazer(filter(nhl, regvsplay=="regular"), type="text")
##
## =====================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## ---------------------------------------------------------------------
## goals_home 1,230 2.815 1.597 0 2 4 8
## goals_visit 1,230 2.609 1.551 0 1 4 9
## attendance 1,230 17,571.820 3,142.400 9,021 16,440 19,070 67,246
## ---------------------------------------------------------------------
stargazer(filter(nhl, regvsplay=="playoffs"), type="text")
##
## ===================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## -------------------------------------------------------------------
## goals_home 91 2.670 1.484 0 2 4 6
## goals_visit 91 2.582 1.585 0 1 3.5 6
## attendance 91 18,512.510 1,243.239 15,795 17,562 19,092 22,260
## -------------------------------------------------------------------
t.test(nhl$attendance ~ nhl$regvsplay)
##
## Welch Two Sample t-test
##
## data: nhl$attendance by nhl$regvsplay
## t = 5.9479, df = 192.04, p-value = 1.263e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 628.7419 1252.6333
## sample estimates:
## mean in group playoffs mean in group regular
## 18512.51 17571.82
Do you remember how to interpret a t-stat, or the p-value? Is the difference in attendance statistically significant?
In Lab 3 we plotted the GE and AAPL in one graph. We did that by including two geom_line()
layers, and asking each layer to use a different variable for what should be mapped to y axis (..+geom_line(aes(y=SP500)) + geom_line(aes(y=NASDQ))
). We used two layers and two geom_
functions because we had two variables. Suppose we have just one variable, X, that we want to plot, but there is another variable, Y, that identifies different groups of observations. If we want to plot X by different groups identified by Y, we can map color to Y. For example, if we want to plot attendance by whether or not the game is a regular season game or a playoff game, we can do the following.
ggplot(nhl) + geom_density(aes(x=attendance, color=regvsplay), adjust=3)
Notice that we included regvsplay
in the aes()
function telling ggplot to map attendance to x axis and regvsplay
to color.
In contrast, suppose we wanted to plot the density of goals by home team vs the density of goals by visiting team. The way our data is organized we have two variables, so we are going to use two geom_
functions.
ggplot(nhl) + geom_density(aes(x=goals_home), color="blue", adjust=3) +
geom_density(aes(x=goals_visit), color="red", adjust=3) +
xlab("goals")
Use function wday(nhl$Date, label=TRUE)
from package lubridate to create day of the week variable in your NHL data.
Plot densities of attendance by day of the week. Which days have the highest attendance? Which days have the most reliable attendance?
ifelse()
and case_when()
functionsSuppose I wanted to count wins by the home team. Right now, all we have is goals by each team. Let’s create a variable that equals 1 if the home team wins and 0 if the home team loses. (We would have to rethink this if there were ties in NHL.) We can use the ifelse()
function. Its first argument is a logical condition, the second argument is the value the function returns when the logical condition is true, the third argument is the value the function returns when the logical condition is false.
nhl <- nhl %>% mutate(homewin = ifelse(goals_home-goals_visit>0,1,0))
What percentage of home games are won?
summary(nhl$homewin)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.5276 1.0000 1.0000
case_when()
is similar to ifelse()
but is particularly useful when we have more than one logical condition on which to create a new variable. case_when
squentially evaluates listed logical conditions and assigns value after ~
.
nhl <- nhl %>%
mutate(outcome = case_when(
goals_home-goals_visit > 3 ~ "big home win",
goals_home-goals_visit > 0 ~ "home win",
goals_home-goals_visit < -3 ~ "big home loss",
goals_home-goals_visit < 0 ~ "home loss"))
Notice that the order in which the conditions are listed matters. Our first condition assigns values “big home win” when goals_home-goal_visit > 3
. Our second condition is goals_home-goal_visit > 0
encompases goals_home-goal_visit > 3
but since those values have already been assigned, only home wins by one to three goals are “home wins”.
table()
and crosstab()
)Often times we want to tabulate the values that a variable takes on. Function table()
does that.
table(nhl$homewin)
##
## 0 1
## 624 697
table(nhl$outcome)
##
## big home loss big home win home loss home win
## 63 87 561 610
Function crosstab()
from package descr
can do nice cross-tabulations (i.e. cross tabulating frequencies across two variables)
library(descr)
## Warning: package 'descr' was built under R version 3.6.3
crosstab(nhl$homewin, nhl$regvsplay, plot=FALSE)
## Cell Contents
## |-------------------------|
## | Count |
## |-------------------------|
##
## =========================================
## nhl$regvsplay
## nhl$homewin playoffs regular Total
## -----------------------------------------
## 0 45 579 624
## -----------------------------------------
## 1 46 651 697
## -----------------------------------------
## Total 91 1230 1321
## =========================================
How would you create a variable that would identify each team as either Canadian or U.S.?
Is attendance higher at Canadian home games than at U.S. home games?
Load in the merged NYC weather and DJI data from the previous lab. You can run your previous lab and add function write_csv()
at the end to save your data. At the beginning of this lab, you can use read_csv()
to load the data in. Make sure that you are in your business analytics project
so that R know which folder to read the data from.
Create a new variable that takes on value yes
if there was some precipitation and no
otherwise. How many days were there with precipiation compared to days without?
What is the mean return on days when there was precipitation compared to when there was no precipitation? Is the difference between mean DJI return with or without precipitation statistically significant? (Hint: You have to do a difference in means test here. If you remember the formula, you can do it manually. If you don’t remember the formula or don’t feel like punching the numbers into the calculator, you can use t.test()
.)
Plot the densities of daily returns when there is no precipitation overlayed with density of returns when there is precipitation. (Hint: Returns are volatile and sometimes can take extreme values. If you want your plot to only display density withing a certain range you can do so buy adding this to your plot + coord_cartesian(xlim = c(-2, 2))
).
What do the two densities show - what is your interpretation of this analysis?