- selecting variables using
`select()`

- selecting observations using
`filter()`

- using logical operators
`==`

,`!=`

,`&`

,`|`

- sorting data using
`arrange()`

- creating new variables using
`mutate()`

- summarizing and plotting data

Let’s bring up the APPL closing values data and rename the adjusted closing price to `appl`

as we did in the previous lab.

```
library(tidyverse)
mydata <- read_csv("https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1&period2=2000000000&interval=1d&events=history",
col_types=cols(Date=col_date(),
Open=col_double(),
High=col_double(),
Low=col_double(),
Close=col_double(),
`Adj Close`=col_double(),
Volume=col_double()))
mydata <- rename(mydata, appl=`Adj Close`)
```

The `mydata`

data frame contains a number of variables we do not intend to use. Let’s use a function `select()`

to select the variables we plan to use. The function takes the name of a data frame as the first argument, and the names of the variables to keep as the following arguments.

`mydata <- select(mydata, Date, appl)`

Often we like our data frames to be sorted a certain way, e.g. by date in ascending chronological order, …

```
mydata <- arrange(mydata, Date)
head(mydata, n=3)
```

```
## # A tibble: 3 x 2
## Date appl
## <date> <dbl>
## 1 1980-12-12 0.407
## 2 1980-12-15 0.386
## 3 1980-12-16 0.357
```

…or by date in descending chronological order. (I want to keep `mydata`

in ascending order, so I create a new data frame with descending order.)

```
mydata_descending <- arrange(mydata, desc(Date))
head(mydata_descending, n=3)
```

```
## # A tibble: 3 x 2
## Date appl
## <date> <dbl>
## 1 2020-04-03 243.
## 2 2020-04-02 245.
## 3 2020-04-01 241.
```

Let’s look at data starting in January 1, 1990. Function `filter()`

keeps only observations that satisfy the condition inside the function. Note that in comparing our `Date`

to a date, we needed to create a date using the `as.Date()`

function.

```
mydata <- filter(mydata, Date >= as.Date("1990-01-01"))
head(mydata, n=3)
```

```
## # A tibble: 3 x 2
## Date appl
## <date> <dbl>
## 1 1990-01-02 1.08
## 2 1990-01-03 1.09
## 3 1990-01-04 1.09
```

The `>=`

logical operator is pretty straightforward. The syntax for other operators is as follows: `==`

is EQUAL; `!=`

is NOT EQUAL; `&`

is AND; `|`

is OR. For example, the code below will keep rows for March 1, 2010 and when Apple price is between 100 and 101.

`not_useful_data_frame <- filter(mydata, Date==as.Date("2010-03-01") | (appl > 100 & appl<101))`

Let’s calculate the daily returns on the Apple stock, i.e. the percentage change from the previous day’s adjusted closing price to today’s adjusted closing price. We first create a new variable that equals previous day’s adjusted closing price. We use the function `lag()`

which takes the previous observation. We can do this using the `mutate()`

function.

`mydata <- mutate(mydata, appl_lag=lag(appl))`

And then create another new variable called `appl_return`

```
mydata <- mutate(mydata, appl_return = (appl-appl_lag)/appl_lag*100)
print(mydata, n=3)
```

```
## # A tibble: 7,624 x 4
## Date appl appl_lag appl_return
## <date> <dbl> <dbl> <dbl>
## 1 1990-01-02 1.08 NA NA
## 2 1990-01-03 1.09 1.08 0.671
## 3 1990-01-04 1.09 1.09 0.333
## # ... with 7,621 more rows
```

`summary(mydata$appl_return)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -51.86916 -1.23321 0.02235 0.11073 1.41239 33.22783 1
```

We see the minimum return was around -51%. Let’s find out when this happened by sorting the data by the return and looking at a few observations:

```
mydata <- arrange(mydata, appl_return)
head(mydata, n=3)
```

```
## # A tibble: 3 x 4
## Date appl appl_lag appl_return
## <date> <dbl> <dbl> <dbl>
## 1 2000-09-29 1.60 3.32 -51.9
## 2 1993-07-16 0.825 1.07 -23.1
## 3 2008-09-29 13.1 15.9 -17.9
```

Finally, let’s plot the histogram of the daily return.

`ggplot(mydata, aes(x=appl_return)) + geom_histogram() `

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

`## Warning: Removed 1 rows containing non-finite values (stat_bin).`

- Load in the NHL data from https://dvorakt.github.io/business_analytics/lab1/NHLseason2016.csv
- Create a new variable score which is a difference between goals scored by the home team minus goals scored by the visiting team.
- Which two matches were the most lopsided during the 2016 season, i.e. in which two matches did the home team win and lose by the largest margin.

`lubridate`

, `month()`

, `lead()`

and `filter()`

)Let’s turn our daily APPL data into monthly data by keeping only the last days of each month. This will allow us to calculate monthly returns from close of the previous month to close of this month. We will use function `month()`

to identify the month component of each date. This function is part of the `lubridate`

package.

```
library(lubridate)
mydata <- mutate(mydata, month=month(Date))
mydata <- arrange(mydata, Date)
print(mydata,n=25) #show the first 25 observations
```

```
## # A tibble: 7,624 x 5
## Date appl appl_lag appl_return month
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 1990-01-02 1.08 NA NA 1
## 2 1990-01-03 1.09 1.08 0.671 1
## 3 1990-01-04 1.09 1.09 0.333 1
## 4 1990-01-05 1.09 1.09 0.332 1
## 5 1990-01-08 1.10 1.09 0.662 1
## 6 1990-01-09 1.09 1.10 -0.987 1
## 7 1990-01-10 1.04 1.09 -4.32 1
## 8 1990-01-11 0.999 1.04 -4.17 1
## 9 1990-01-12 0.999 0.999 0 1
## 10 1990-01-15 0.992 0.999 -0.725 1
## 11 1990-01-16 1.01 0.992 1.82 1
## 12 1990-01-17 0.963 1.01 -4.66 1
## 13 1990-01-18 0.938 0.963 -2.63 1
## 14 1990-01-19 0.992 0.938 5.79 1
## 15 1990-01-22 0.963 0.992 -2.92 1
## 16 1990-01-23 0.977 0.963 1.50 1
## 17 1990-01-24 0.985 0.977 0.741 1
## 18 1990-01-25 0.988 0.985 0.368 1
## 19 1990-01-26 0.949 0.988 -4.03 1
## 20 1990-01-29 0.963 0.949 1.53 1
## 21 1990-01-30 0.985 0.963 2.26 1
## 22 1990-01-31 0.985 0.985 0 1
## 23 1990-02-01 0.974 0.985 -1.10 2
## 24 1990-02-02 0.992 0.974 1.86 2
## 25 1990-02-05 1.01 0.992 2.19 2
## # ... with 7,599 more rows
```

We only want to keep the observations for which the next month is different. We can access the next observation by using the function `lead()`

. Then we will keep only observations for which the next month is different from this month. Also, let’s put the monthly data into a different data frame.

```
monthly <- filter(mydata, month != lead(month))
monthly <- select(monthly, Date, month, appl)
head(monthly, n=3)
```

```
## # A tibble: 3 x 3
## Date month appl
## <date> <dbl> <dbl>
## 1 1990-01-31 1 0.985
## 2 1990-02-28 2 0.988
## 3 1990-03-30 3 1.17
```

Excellent, `monthly`

now contains only last days of each month.

- Create a subset of the NHL data that contains games played by Boston Bruins.
- SUPER CHALLENGE: How many games did Boston Bruins win?

Write an R Markdown file that does the analysis and answers the questions below. Knit the R Markdown into an HTML or a pdf file.

- Load in closing values of GE stock. Rename Adj Close to
`ge`

.

- Calculate a new variable that has daily returns on GE. (Make sure your data is properly sorted before calculating returns.)

- Calculate mean daily return between 1980 and 1990. (Hint: you will need two conditions in your
`filter()`

function, Date has to be greater than or equal to 1/1/1980 AND less than 1/1/1990, the AND in R is specified with the`&`

sign)

- What was the best and worst day for GE since 1990? (Hint: the opposite of function
`head()`

is`tail()`

)

- Plot the histogram of daily returns on GE since 1990, and the time plot of daily returns since 1990. Are the two graphs consistent with each other?

- Based on the graphs you created, would you say that volatility of GE price is constant over time?

- Create monthly data of (end of the month) closing GE prices.

- Calculate monthly price returns on GE. What is the average monthly return? Which months were the best and worst?

- Download data “covid by county with population.csv” from Nexus. The data includes daily number of coronavirus cases and deaths for each county in the U.S. The data comes from NYT’s github page. I also included information on each county’s population. The population data comes from the American Community Survey. How many observations do we have? What is the unit of observation?

- Create a dataframe that includes only the most recent number of cases for each county. How many observations do you have now?

- Create a new variable that has the number of cases per 1000 people. Show the top 6 counties with the highest number of cases per 1000 people.