Lab 2: Manipulating Data I

Learning objectives:

  • selecting variables using select()
  • selecting observations using filter()
  • using logical operators ==,!=,&,|
  • sorting data using arrange()
  • creating new variables using mutate()
  • summarizing and plotting data

1. Selecting variables

Let’s bring up the APPL closing values data and rename the adjusted closing price to appl as we did in the previous lab.

library(tidyverse)
mydata <- read_csv("https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1&period2=2000000000&interval=1d&events=history",
                 col_types=cols(Date=col_date(),
                                Open=col_double(),
                                High=col_double(),
                                Low=col_double(),
                                Close=col_double(), 
                                `Adj Close`=col_double(),
                                Volume=col_double()))
mydata <- rename(mydata, appl=`Adj Close`)

The mydata data frame contains a number of variables we do not intend to use. Let’s use a function select() to select the variables we plan to use. The function takes the name of a data frame as the first argument, and the names of the variables to keep as the following arguments.

mydata <- select(mydata, Date, appl)

2. Sorting data

Often we like our data frames to be sorted a certain way, e.g. by date in ascending chronological order, …

mydata <- arrange(mydata, Date)
head(mydata, n=3)
## # A tibble: 3 x 2
##   Date        appl
##   <date>     <dbl>
## 1 1980-12-12 0.407
## 2 1980-12-15 0.386
## 3 1980-12-16 0.357

…or by date in descending chronological order. (I want to keep mydata in ascending order, so I create a new data frame with descending order.)

mydata_descending <- arrange(mydata, desc(Date))
head(mydata_descending, n=3)
## # A tibble: 3 x 2
##   Date        appl
##   <date>     <dbl>
## 1 2020-04-03  243.
## 2 2020-04-02  245.
## 3 2020-04-01  241.

3. Selecting observations

Let’s look at data starting in January 1, 1990. Function filter() keeps only observations that satisfy the condition inside the function. Note that in comparing our Date to a date, we needed to create a date using the as.Date() function.

mydata <- filter(mydata, Date >= as.Date("1990-01-01"))
head(mydata, n=3)
## # A tibble: 3 x 2
##   Date        appl
##   <date>     <dbl>
## 1 1990-01-02  1.08
## 2 1990-01-03  1.09
## 3 1990-01-04  1.09

4. Using logical operators

The >= logical operator is pretty straightforward. The syntax for other operators is as follows: == is EQUAL; != is NOT EQUAL; & is AND; | is OR. For example, the code below will keep rows for March 1, 2010 and when Apple price is between 100 and 101.

not_useful_data_frame <- filter(mydata, Date==as.Date("2010-03-01") | (appl > 100 & appl<101))

5. Creating new variables

Let’s calculate the daily returns on the Apple stock, i.e. the percentage change from the previous day’s adjusted closing price to today’s adjusted closing price. We first create a new variable that equals previous day’s adjusted closing price. We use the function lag() which takes the previous observation. We can do this using the mutate() function.

mydata <- mutate(mydata, appl_lag=lag(appl))

And then create another new variable called appl_return

mydata <- mutate(mydata, appl_return = (appl-appl_lag)/appl_lag*100)
print(mydata, n=3)
## # A tibble: 7,624 x 4
##   Date        appl appl_lag appl_return
##   <date>     <dbl>    <dbl>       <dbl>
## 1 1990-01-02  1.08    NA         NA    
## 2 1990-01-03  1.09     1.08       0.671
## 3 1990-01-04  1.09     1.09       0.333
## # ... with 7,621 more rows

6. Summarizing and plotting data

summary(mydata$appl_return)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## -51.86916  -1.23321   0.02235   0.11073   1.41239  33.22783         1

We see the minimum return was around -51%. Let’s find out when this happened by sorting the data by the return and looking at a few observations:

mydata <- arrange(mydata, appl_return)
head(mydata, n=3)
## # A tibble: 3 x 4
##   Date         appl appl_lag appl_return
##   <date>      <dbl>    <dbl>       <dbl>
## 1 2000-09-29  1.60      3.32       -51.9
## 2 1993-07-16  0.825     1.07       -23.1
## 3 2008-09-29 13.1      15.9        -17.9

Finally, let’s plot the histogram of the daily return.

ggplot(mydata, aes(x=appl_return)) + geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).

IN-CLASS EXERCISE

  1. Load in the NHL data from https://dvorakt.github.io/business_analytics/lab1/NHLseason2016.csv
  2. Create a new variable score which is a difference between goals scored by the home team minus goals scored by the visiting team.
  3. Which two matches were the most lopsided during the 2016 season, i.e. in which two matches did the home team win and lose by the largest margin.

7. Turning daily data into monthly (application of lubridate, month(), lead() and filter())

Let’s turn our daily APPL data into monthly data by keeping only the last days of each month. This will allow us to calculate monthly returns from close of the previous month to close of this month. We will use function month() to identify the month component of each date. This function is part of the lubridate package.

library(lubridate)
mydata <- mutate(mydata, month=month(Date))
mydata <- arrange(mydata, Date)
print(mydata,n=25) #show the first 25 observations
## # A tibble: 7,624 x 5
##    Date        appl appl_lag appl_return month
##    <date>     <dbl>    <dbl>       <dbl> <dbl>
##  1 1990-01-02 1.08    NA          NA         1
##  2 1990-01-03 1.09     1.08        0.671     1
##  3 1990-01-04 1.09     1.09        0.333     1
##  4 1990-01-05 1.09     1.09        0.332     1
##  5 1990-01-08 1.10     1.09        0.662     1
##  6 1990-01-09 1.09     1.10       -0.987     1
##  7 1990-01-10 1.04     1.09       -4.32      1
##  8 1990-01-11 0.999    1.04       -4.17      1
##  9 1990-01-12 0.999    0.999       0         1
## 10 1990-01-15 0.992    0.999      -0.725     1
## 11 1990-01-16 1.01     0.992       1.82      1
## 12 1990-01-17 0.963    1.01       -4.66      1
## 13 1990-01-18 0.938    0.963      -2.63      1
## 14 1990-01-19 0.992    0.938       5.79      1
## 15 1990-01-22 0.963    0.992      -2.92      1
## 16 1990-01-23 0.977    0.963       1.50      1
## 17 1990-01-24 0.985    0.977       0.741     1
## 18 1990-01-25 0.988    0.985       0.368     1
## 19 1990-01-26 0.949    0.988      -4.03      1
## 20 1990-01-29 0.963    0.949       1.53      1
## 21 1990-01-30 0.985    0.963       2.26      1
## 22 1990-01-31 0.985    0.985       0         1
## 23 1990-02-01 0.974    0.985      -1.10      2
## 24 1990-02-02 0.992    0.974       1.86      2
## 25 1990-02-05 1.01     0.992       2.19      2
## # ... with 7,599 more rows

We only want to keep the observations for which the next month is different. We can access the next observation by using the function lead(). Then we will keep only observations for which the next month is different from this month. Also, let’s put the monthly data into a different data frame.

monthly <- filter(mydata, month != lead(month))
monthly <- select(monthly, Date, month, appl)
head(monthly, n=3)
## # A tibble: 3 x 3
##   Date       month  appl
##   <date>     <dbl> <dbl>
## 1 1990-01-31     1 0.985
## 2 1990-02-28     2 0.988
## 3 1990-03-30     3 1.17

Excellent, monthly now contains only last days of each month.

IN-CLASS EXERCISE

  1. Create a subset of the NHL data that contains games played by Boston Bruins.
  2. SUPER CHALLENGE: How many games did Boston Bruins win?


HOMEWORK EXERCISE:

Write an R Markdown file that does the analysis and answers the questions below. Knit the R Markdown into an HTML or a pdf file.

  1. Load in closing values of GE stock. Rename Adj Close to ge.
  2. Calculate a new variable that has daily returns on GE. (Make sure your data is properly sorted before calculating returns.)
  3. Calculate mean daily return between 1980 and 1990. (Hint: you will need two conditions in your filter() function, Date has to be greater than or equal to 1/1/1980 AND less than 1/1/1990, the AND in R is specified with the & sign)
  4. What was the best and worst day for GE since 1990? (Hint: the opposite of function head() is tail())
  5. Plot the histogram of daily returns on GE since 1990, and the time plot of daily returns since 1990. Are the two graphs consistent with each other?
  6. Based on the graphs you created, would you say that volatility of GE price is constant over time?
  7. Create monthly data of (end of the month) closing GE prices.
  8. Calculate monthly price returns on GE. What is the average monthly return? Which months were the best and worst?
  9. Download data “covid by county with population.csv” from Nexus. The data includes daily number of coronavirus cases and deaths for each county in the U.S. The data comes from NYT’s github page. I also included information on each county’s population. The population data comes from the American Community Survey. How many observations do we have? What is the unit of observation?
  10. Create a dataframe that includes only the most recent number of cases for each county. How many observations do you have now?
  11. Create a new variable that has the number of cases per 1000 people. Show the top 6 counties with the highest number of cases per 1000 people.