select()
filter()
==
,!=
,&
,|
arrange()
mutate()
Let’s bring up the APPL closing values data and rename the adjusted closing price to appl
as we did in the previous lab.
library(tidyverse)
mydata <- read_csv("https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1&period2=2000000000&interval=1d&events=history",
col_types=cols(Date=col_date(),
Open=col_double(),
High=col_double(),
Low=col_double(),
Close=col_double(),
`Adj Close`=col_double(),
Volume=col_double()))
mydata <- rename(mydata, appl=`Adj Close`)
The mydata
data frame contains a number of variables we do not intend to use. Let’s use a function select()
to select the variables we plan to use. The function takes the name of a data frame as the first argument, and the names of the variables to keep as the following arguments.
mydata <- select(mydata, Date, appl)
Often we like our data frames to be sorted a certain way, e.g. by date in ascending chronological order, …
mydata <- arrange(mydata, Date)
head(mydata, n=3)
## # A tibble: 3 x 2
## Date appl
## <date> <dbl>
## 1 1980-12-12 0.407
## 2 1980-12-15 0.386
## 3 1980-12-16 0.357
…or by date in descending chronological order. (I want to keep mydata
in ascending order, so I create a new data frame with descending order.)
mydata_descending <- arrange(mydata, desc(Date))
head(mydata_descending, n=3)
## # A tibble: 3 x 2
## Date appl
## <date> <dbl>
## 1 2020-04-03 243.
## 2 2020-04-02 245.
## 3 2020-04-01 241.
Let’s look at data starting in January 1, 1990. Function filter()
keeps only observations that satisfy the condition inside the function. Note that in comparing our Date
to a date, we needed to create a date using the as.Date()
function.
mydata <- filter(mydata, Date >= as.Date("1990-01-01"))
head(mydata, n=3)
## # A tibble: 3 x 2
## Date appl
## <date> <dbl>
## 1 1990-01-02 1.08
## 2 1990-01-03 1.09
## 3 1990-01-04 1.09
The >=
logical operator is pretty straightforward. The syntax for other operators is as follows: ==
is EQUAL; !=
is NOT EQUAL; &
is AND; |
is OR. For example, the code below will keep rows for March 1, 2010 and when Apple price is between 100 and 101.
not_useful_data_frame <- filter(mydata, Date==as.Date("2010-03-01") | (appl > 100 & appl<101))
Let’s calculate the daily returns on the Apple stock, i.e. the percentage change from the previous day’s adjusted closing price to today’s adjusted closing price. We first create a new variable that equals previous day’s adjusted closing price. We use the function lag()
which takes the previous observation. We can do this using the mutate()
function.
mydata <- mutate(mydata, appl_lag=lag(appl))
And then create another new variable called appl_return
mydata <- mutate(mydata, appl_return = (appl-appl_lag)/appl_lag*100)
print(mydata, n=3)
## # A tibble: 7,624 x 4
## Date appl appl_lag appl_return
## <date> <dbl> <dbl> <dbl>
## 1 1990-01-02 1.08 NA NA
## 2 1990-01-03 1.09 1.08 0.671
## 3 1990-01-04 1.09 1.09 0.333
## # ... with 7,621 more rows
summary(mydata$appl_return)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -51.86916 -1.23321 0.02235 0.11073 1.41239 33.22783 1
We see the minimum return was around -51%. Let’s find out when this happened by sorting the data by the return and looking at a few observations:
mydata <- arrange(mydata, appl_return)
head(mydata, n=3)
## # A tibble: 3 x 4
## Date appl appl_lag appl_return
## <date> <dbl> <dbl> <dbl>
## 1 2000-09-29 1.60 3.32 -51.9
## 2 1993-07-16 0.825 1.07 -23.1
## 3 2008-09-29 13.1 15.9 -17.9
Finally, let’s plot the histogram of the daily return.
ggplot(mydata, aes(x=appl_return)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
lubridate
, month()
, lead()
and filter()
)Let’s turn our daily APPL data into monthly data by keeping only the last days of each month. This will allow us to calculate monthly returns from close of the previous month to close of this month. We will use function month()
to identify the month component of each date. This function is part of the lubridate
package.
library(lubridate)
mydata <- mutate(mydata, month=month(Date))
mydata <- arrange(mydata, Date)
print(mydata,n=25) #show the first 25 observations
## # A tibble: 7,624 x 5
## Date appl appl_lag appl_return month
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 1990-01-02 1.08 NA NA 1
## 2 1990-01-03 1.09 1.08 0.671 1
## 3 1990-01-04 1.09 1.09 0.333 1
## 4 1990-01-05 1.09 1.09 0.332 1
## 5 1990-01-08 1.10 1.09 0.662 1
## 6 1990-01-09 1.09 1.10 -0.987 1
## 7 1990-01-10 1.04 1.09 -4.32 1
## 8 1990-01-11 0.999 1.04 -4.17 1
## 9 1990-01-12 0.999 0.999 0 1
## 10 1990-01-15 0.992 0.999 -0.725 1
## 11 1990-01-16 1.01 0.992 1.82 1
## 12 1990-01-17 0.963 1.01 -4.66 1
## 13 1990-01-18 0.938 0.963 -2.63 1
## 14 1990-01-19 0.992 0.938 5.79 1
## 15 1990-01-22 0.963 0.992 -2.92 1
## 16 1990-01-23 0.977 0.963 1.50 1
## 17 1990-01-24 0.985 0.977 0.741 1
## 18 1990-01-25 0.988 0.985 0.368 1
## 19 1990-01-26 0.949 0.988 -4.03 1
## 20 1990-01-29 0.963 0.949 1.53 1
## 21 1990-01-30 0.985 0.963 2.26 1
## 22 1990-01-31 0.985 0.985 0 1
## 23 1990-02-01 0.974 0.985 -1.10 2
## 24 1990-02-02 0.992 0.974 1.86 2
## 25 1990-02-05 1.01 0.992 2.19 2
## # ... with 7,599 more rows
We only want to keep the observations for which the next month is different. We can access the next observation by using the function lead()
. Then we will keep only observations for which the next month is different from this month. Also, let’s put the monthly data into a different data frame.
monthly <- filter(mydata, month != lead(month))
monthly <- select(monthly, Date, month, appl)
head(monthly, n=3)
## # A tibble: 3 x 3
## Date month appl
## <date> <dbl> <dbl>
## 1 1990-01-31 1 0.985
## 2 1990-02-28 2 0.988
## 3 1990-03-30 3 1.17
Excellent, monthly
now contains only last days of each month.
Write an R Markdown file that does the analysis and answers the questions below. Knit the R Markdown into an HTML or a pdf file.
ge
.filter()
function, Date has to be greater than or equal to 1/1/1980 AND less than 1/1/1990, the AND in R is specified with the &
sign)head()
is tail()
)