`ifelse()`

, tabulating frequencies- appending (as opposed to merging) data frames using
`bind_rows()`

function - comparing means using
`t.test()`

- plotting two variables
- creating a new variable using
`ifelse()`

and`case_when()`

functions - tabulating frequencies using
`table()`

and`crosstab()`

Let’s begin by reading in our NHL regular 2016 season data and the 2016 playoffs. Check out the variables and the number of observations.

```
library(tidyverse)
library(lubridate)
library(stargazer)
nhl <- read_csv("https://dvorakt.github.io/business_analytics/lab1/NHLseason2016.csv")
playoffs <- read_csv("https://dvorakt.github.io/business_analytics/lab1/NHLplayoffs2016.csv")
```

`bind_rows()`

functionIn Lab 3 we learned how to merge data, i.e. putting two data frames side by side. If we wanted to bring the regular season games and the playoffs together into one data frame, we would want to stack one data frame on top of the other, i.e. append, or bind rows. However, before we do that, let’s create a new variable that will mark whether the game was part of the regular season or playoffs. (Let’s call it `regvsplay`

.)

```
nhl <- nhl %>% mutate(regvsplay = "regular")
playoffs <- playoffs %>% mutate(regvsplay = "playoffs")
```

The function we use to stack two data frames on top of each other is `bind_rows()`

and it takes the two data frames as arguments. The names of the variables in each data frame should match so that corresponding columns are stacked on top of each other.

`nhl <- bind_rows(nhl, playoffs)`

How many observations do we now have in `nhl`

?

Suppose we are interested in the difference between average attendance at regular season games vs attendance at playoff games. We can use feed stargazer filtered NHL data, but if we want to also see if the difference is statistically significant, we can use `t.test()`

. The argument `t.test()`

takes is a formula. The syntax for entering a formula is `y ~ x`

where `y`

is a numeric variable and `x`

is character variable that identifies the two groups.

```
nhl <- as.data.frame(nhl)
stargazer(filter(nhl, regvsplay=="regular"), type="text")
```

```
##
## =====================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## ---------------------------------------------------------------------
## goals_home 1,230 2.815 1.597 0 2 4 8
## goals_visit 1,230 2.609 1.551 0 1 4 9
## attendance 1,230 17,571.820 3,142.400 9,021 16,440 19,070 67,246
## ---------------------------------------------------------------------
```

`stargazer(filter(nhl, regvsplay=="playoffs"), type="text")`

```
##
## ===================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## -------------------------------------------------------------------
## goals_home 91 2.670 1.484 0 2 4 6
## goals_visit 91 2.582 1.585 0 1 3.5 6
## attendance 91 18,512.510 1,243.239 15,795 17,562 19,092 22,260
## -------------------------------------------------------------------
```

`t.test(nhl$attendance ~ nhl$regvsplay)`

```
##
## Welch Two Sample t-test
##
## data: nhl$attendance by nhl$regvsplay
## t = 5.9479, df = 192.04, p-value = 1.263e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 628.7419 1252.6333
## sample estimates:
## mean in group playoffs mean in group regular
## 18512.51 17571.82
```

Do you remember how to interpret a t-stat, or the p-value? Is the difference in attendance statistically significant?

In Lab 3 we plotted the GE and AAPL in one graph. We did that by including two `geom_line()`

layers, and asking each layer to use a different variable for what should be mapped to y axis (..+`geom_line(aes(y=SP500)) + geom_line(aes(y=NASDQ))`

). We used two layers and *two* `geom_`

functions because we had *two* variables. Suppose we have just one variable, X, that we want to plot, but there is another variable, Y, that identifies different groups of observations. If we want to plot X by different groups identified by Y, we can map color to Y. For example, if we want to plot attendance by whether or not the game is a regular season game or a playoff game, we can do the following.

`ggplot(nhl) + geom_density(aes(x=attendance, color=regvsplay), adjust=3)`

Notice that we included `regvsplay`

in the `aes()`

function telling ggplot to map attendance to *x* axis and `regvsplay`

to *color*.

In contrast, suppose we wanted to plot the density of goals by home team vs the density of goals by visiting team. The way our data is organized we have *two* variables, so we are going to use two `geom_`

functions.

```
ggplot(nhl) + geom_density(aes(x=goals_home), color="blue", adjust=3) +
geom_density(aes(x=goals_visit), color="red", adjust=3) +
xlab("goals")
```

Use function

`wday(nhl$Date, label=TRUE)`

from package lubridate to create day of the week variable in your NHL data.Plot densities of attendance by day of the week. Which days have the highest attendance? Which days have the most reliable attendance?

`ifelse()`

and `case_when()`

functionsSuppose I wanted to count wins by the home team. Right now, all we have is goals by each team. Let’s create a variable that equals 1 if the home team wins and 0 if the home team loses. (We would have to rethink this if there were ties in NHL.) We can use the `ifelse()`

function. Its first argument is a logical condition, the second argument is the value the function returns when the logical condition is true, the third argument is the value the function returns when the logical condition is false.

`nhl <- nhl %>% mutate(homewin = ifelse(goals_home-goals_visit>0,1,0))`

What percentage of home games are won?

`summary(nhl$homewin)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.5276 1.0000 1.0000
```

`case_when()`

is similar to `ifelse()`

but is particularly useful when we have more than one logical condition on which to create a new variable. `case_when`

squentially evaluates listed logical conditions and assigns value after `~`

.

```
nhl <- nhl %>%
mutate(outcome = case_when(
goals_home-goals_visit > 3 ~ "big home win",
goals_home-goals_visit > 0 ~ "home win",
goals_home-goals_visit < -3 ~ "big home loss",
goals_home-goals_visit < 0 ~ "home loss"))
```

Notice that the *order* in which the conditions are listed *matters*. Our first condition assigns values “big home win” when `goals_home-goal_visit > 3`

. Our second condition is `goals_home-goal_visit > 0`

encompases `goals_home-goal_visit > 3`

but since those values have already been assigned, only home wins by one to three goals are “home wins”.

`table()`

and `crosstab()`

)Often times we want to tabulate the values that a variable takes on. Function `table()`

does that.

`table(nhl$homewin)`

```
##
## 0 1
## 624 697
```

`table(nhl$outcome)`

```
##
## big home loss big home win home loss home win
## 63 87 561 610
```

Function `crosstab()`

from package `descr`

can do nice cross-tabulations (i.e. cross tabulating frequencies across two variables)

`library(descr)`

`## Warning: package 'descr' was built under R version 3.6.3`

`crosstab(nhl$homewin, nhl$regvsplay, plot=FALSE)`

```
## Cell Contents
## |-------------------------|
## | Count |
## |-------------------------|
##
## =========================================
## nhl$regvsplay
## nhl$homewin playoffs regular Total
## -----------------------------------------
## 0 45 579 624
## -----------------------------------------
## 1 46 651 697
## -----------------------------------------
## Total 91 1230 1321
## =========================================
```

How would you create a variable that would identify each team as either Canadian or U.S.?

Is attendance higher at Canadian home games than at U.S. home games?

Load in the merged NYC weather and DJI data from the previous lab. You can run your previous lab and add function

`write_csv()`

at the end to save your data. At the beginning of this lab, you can use`read_csv()`

to load the data in. Make sure that you are in your business analytics`project`

so that R know which folder to read the data from.Create a new variable that takes on value

`yes`

if there was some precipitation and`no`

otherwise. How many days were there with precipiation compared to days without?What is the mean return on days when there was precipitation compared to when there was no precipitation? Is the difference between mean DJI return with or without precipitation statistically significant? (Hint: You have to do a difference in means test here. If you remember the formula, you can do it manually. If you don’t remember the formula or don’t feel like punching the numbers into the calculator, you can use

`t.test()`

.)Plot the densities of daily returns when there is no precipitation overlayed with density of returns when there is precipitation. (Hint: Returns are volatile and sometimes can take extreme values. If you want your plot to only display density withing a certain range you can do so buy adding this to your plot

`+ coord_cartesian(xlim = c(-2, 2))`

).What do the two densities show - what is your interpretation of this analysis?