read_csv
paste()
functiongeom_smooth()
This lab is based on an interesting kaggle competition to predict sales at a German drugstore chain. Here is the description:
Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.
Rossmann is challenging you to predict 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams!
Here are the variables available to us:
Id | an Id that represents a (Store, Date) duple within the test set |
Store | a unique Id for each store |
Sales | the turnover for any given day (this is what you are predicting) |
Customers | the number of customers on a given day |
Open | an indicator for whether the store was open: 0 = closed, 1 = open |
StateHoliday | indicates a state holiday. a = public holiday, b = Easter holiday, c = Christmas, 0 = None |
SchoolHoliday i | ndicates if the (Store, Date) was affected by the closure of public schools |
StoreType | differentiates between 4 different store models: a, b, c, d |
Assortment | describes an assortment level: a = basic, b = extra, c = extended |
CompetitionDistance | distance in meters to the nearest competitor store |
CompetitionOpenSince | gives the approximate year and month of the time the nearest competitor was opened |
Promo | indicates whether a store is running a promo on that day |
There are three data sets: rossmann_train.csv, rossmann_store.csv, rossmann_test.csv
read_csv
Notice that StateHoliday
takes values 0
, a
, b
, and c
. This means that it should be read as a character, since 0
does not have any numerical meaning. However, if read_csv()
sees lots of 0
s and only few a
s, b
s etc, it may decide that StateHoliday
is a numerical variable and the a
s etc. are errors. We can force read_csv
to read StateHoliday
as a character using the col_types
option.
library(tidyverse)
library(lubridate)
library(stargazer)
sales <- read_csv("rossmann_train.csv", col_types = cols(StateHoliday = col_character()))
glimpse(sales)
## Observations: 1,017,209
## Variables: 9
## $ Store <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ DayOfWeek <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ Date <date> 2015-07-31, 2015-07-31, 2015-07-31, 2015-07-31,...
## $ Sales <dbl> 5263, 6064, 8314, 13995, 4822, 5651, 15344, 8492...
## $ Customers <dbl> 555, 625, 821, 1498, 559, 589, 1414, 833, 687, 6...
## $ Open <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Promo <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"...
## $ SchoolHoliday <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, ...
DayOfWeek
is also an integer though we are not sure what the numbers represent. (Week starts on Monday in Europe.) So we will use a function wday()
from the lubridate
package to get the day of the week from a date.
sales$day <- wday(sales$Date, label = TRUE, abbr = TRUE) #creates 'ordered factor'
sales$day <- factor(sales$day, ordered=FALSE) # makes it a factor
head(select(sales, day, DayOfWeek), n=3)
## # A tibble: 3 x 2
## day DayOfWeek
## <fct> <dbl>
## 1 Fri 5
## 2 Fri 5
## 3 Fri 5
sales <- select(sales, -DayOfWeek) #not needed anymore
Let’s see how the StateHoliday
correlates with opening of stores.
table(sales$StateHoliday, sales$Open)
##
## 0 1
## 0 142677 843482
## a 19566 694
## b 6545 145
## c 4029 71
The table above shows us how observations fall into different categories based on whether a store is open (column headings) and whether there is a state holiday and if so what type. It looks like most of the observations fall into no holiday (0) and open (1) category. Holiday a (public holiday) is associate with a lot more stores open than either Easter (b) or Christmas (c). Still, shockingly, we have 140 thousand observations when stores are closed.
IN-CLASS EXERCISE 1: Can you figure out what days of the week are Rossemann stores open?
The data has a variable indicating whether or not a store is open (Open
). Let’s check if there are any sales when the stores are not open (Open=0
).
summary(select(filter(sales,Open==0),Sales))
## Sales
## Min. :0
## 1st Qu.:0
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
Going forward, let’s focus on days when stores are open.
sales <- filter(sales, Open==1)
sales <- select(sales, -Open) #not needed anymore
Let’s calculate average sales for each store so that we have an idea on how the stores differ in size. Let’s then plot average sales in a bar graph and a histogram. Note that the bars in the bar graph are ordered by minus average sales to so that the stores are ordered from biggest to smallest.
avsales <- sales %>% group_by(Store) %>% summarise(av_sales = mean(Sales))
ggplot(avsales, aes(x=reorder(Store,-av_sales),y=av_sales)) + geom_bar(stat="identity")
ggplot(avsales, aes(x=av_sales)) + geom_histogram(binwidth = 100)
summary(avsales)
## Store av_sales
## Min. : 1.0 Min. : 2704
## 1st Qu.: 279.5 1st Qu.: 5322
## Median : 558.0 Median : 6590
## Mean : 558.0 Mean : 6934
## 3rd Qu.: 836.5 3rd Qu.: 7964
## Max. :1115.0 Max. :21757
IN-CLASS EXERCISE 2: Plot sales of store number 101 over time.
We would like to see if the pattern of fluctuations over time is typical. One option we have is to average across all stores for each date. However, such average would be dominated by the fluctuations at the largest stores. The other option is that for each store we calculate percent deviation form that store’s average sales. (We’ll call it dsales
). For each store and date this variable will tell us by what percentage are sales higher or lower relative to that store’s average sales. We can then average these deviations for each day.
sales <- sales %>%
group_by(Store) %>%
mutate(dsales = (Sales-mean(Sales))/mean(Sales)*100)
summary(sales$dsales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -100.000 -19.248 -3.382 0.000 14.941 477.364
We see that half of the store sales are between 19 percent below the store average, and 14 percent above the store average. Let’s now plot these fluctuations by date.
bydate <- sales %>% group_by(Date) %>% summarise(av_dsales=mean(dsales))
ggplot(bydate, aes(x=Date, y=av_dsales)) + geom_line()
IN-CLASS EXERCISE 3: Let’s plot average dsales
by day of the week.
IN-CLASS EXERCISE 4: How about by Promo or StateHoliday?
The data set store
has information on store characteristics including store’s competition.
stores <- read_csv("rossmann_store.csv")
## Parsed with column specification:
## cols(
## Store = col_double(),
## StoreType = col_character(),
## Assortment = col_character(),
## CompetitionDistance = col_double(),
## CompetitionOpenSinceMonth = col_double(),
## CompetitionOpenSinceYear = col_double(),
## Promo2 = col_double(),
## Promo2SinceWeek = col_double(),
## Promo2SinceYear = col_double(),
## PromoInterval = col_character()
## )
summary(select(stores, CompetitionDistance, CompetitionOpenSinceMonth, CompetitionOpenSinceYear))
## CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear
## Min. : 20.0 Min. : 1.000 Min. :1900
## 1st Qu.: 717.5 1st Qu.: 4.000 1st Qu.:2006
## Median : 2325.0 Median : 8.000 Median :2010
## Mean : 5404.9 Mean : 7.225 Mean :2009
## 3rd Qu.: 6882.5 3rd Qu.:10.000 3rd Qu.:2013
## Max. :75860.0 Max. :12.000 Max. :2015
## NA's :3 NA's :354 NA's :354
Competition distance varies from 20 meters to 75 kilometers. There are also three stores with NA
s for competition distance. It is not clear what this means. Let’s ignore these for now, and plot average store sales against the log of distance to competition.
stores <- inner_join(avsales, stores, by="Store")
ggplot(stores, aes(x=CompetitionDistance, y=av_sales)) + geom_point() +
scale_x_continuous(trans="log")
## Warning: Removed 3 rows containing missing values (geom_point).
Looking again at the summary for the competition variables, it looks like 25% of stores got their competition in 2013 or later. Let’s create a date to indicate when competition arrived. Since we don’t know the exact day only month, let’s assume it happened on the 15th of the month.
Function paste()
pastes together the characters in the variables we list as arguments, with sep=
character in between. We use this to create a date. If date since competition is missing, let’s assume it has always been there by setting the date to January 1st, 2000. Notice that I used tidyverse’sif_else()
instead of base R ifelse()
because ifelse()
somewhat oddly does not preserve the date format.
stores <- stores %>%
mutate(compdate = as.Date(paste(CompetitionOpenSinceYear,
CompetitionOpenSinceMonth,15, sep="-")),
compdate = if_else(is.na(compdate), as.Date("2000-01-01"), compdate))
Let’s merge the compdate
into our sales data set. This way we will be able to see if the entrance of competition has an effect on sales.
sales <- inner_join(sales,
select(stores, Store, compdate, StoreType, Assortment),
by="Store") %>%
mutate(competition = ifelse(Date < compdate, "no", "yes"))
bycompetition <- sales %>% group_by(competition) %>% summarize(av_dsales=mean(dsales))
ggplot(bycompetition, aes(x=competition,y=av_dsales)) + geom_bar(stat="identity")
IN-CLASS EXERCISE 5: Let’s limit to only stores that got competition during the period spanning our data so that we can compare sales before and after competition.
Let’s plot the time series of sales for store number 30. To take out the short term fluctuations by plotting geom_smooth().
ggplot(filter(sales, Store==30), aes(x=Date, y=Sales, color=competition)) + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
EXTENDED IN-CLASS EXERCISE: Could we create a data set with sales 100 days before and 100 after competition arrived?
Load in the rossman_train.csv
data on Rossmann store sales. We have data for about two and half years. How many different dates in the data are State Holidays?
According to the data, which dates are considered Christmas holidays?
On what days do Germans celebrate Easter? Are these the same dates every year?
Using the lead()
function, create a variable that equals “yes” when the store is open and it is the day before a holiday, and equals “no” otherwise.
Using only days when stores are open, are sales higher the day before the holiday than other days?
Are sales higher the day after the holiday than other days?