Lab 14: Regression Analysis I

Learning objectives:

specify column type in read_csv
explore data
work with dates
use paste() function
use geom_smooth()

1. Introduction

This lab is based on an interesting kaggle competition to predict sales at a German drugstore chain. Here is the description:

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

Rossmann is challenging you to predict 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams!

Here are the variables available to us:

Id	an Id that represents a (Store, Date) duple within the test set
Store	a unique Id for each store
Sales	the turnover for any given day (this is what you are predicting)
Customers	the number of customers on a given day
Open	an indicator for whether the store was open: 0 = closed, 1 = open
StateHoliday	indicates a state holiday. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
SchoolHoliday i	ndicates if the (Store, Date) was affected by the closure of public schools
StoreType	differentiates between 4 different store models: a, b, c, d
Assortment	describes an assortment level: a = basic, b = extra, c = extended
CompetitionDistance	distance in meters to the nearest competitor store
CompetitionOpenSince	gives the approximate year and month of the time the nearest competitor was opened
Promo	indicates whether a store is running a promo on that day

There are three data sets: rossmann_train.csv, rossmann_store.csv, rossmann_test.csv

2. Specify column type in `read_csv`

Notice that StateHoliday takes values 0, a, b, and c. This means that it should be read as a character, since 0 does not have any numerical meaning. However, if read_csv() sees lots of 0s and only few as, bs etc, it may decide that StateHoliday is a numerical variable and the as etc. are errors. We can force read_csv to read StateHoliday as a character using the col_types option.

library(tidyverse)
library(lubridate)
library(stargazer)
sales <- read_csv("rossmann_train.csv", col_types = cols(StateHoliday = col_character()))
glimpse(sales)

## Observations: 1,017,209
## Variables: 9
## $ Store         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ DayOfWeek     <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ Date          <date> 2015-07-31, 2015-07-31, 2015-07-31, 2015-07-31,...
## $ Sales         <dbl> 5263, 6064, 8314, 13995, 4822, 5651, 15344, 8492...
## $ Customers     <dbl> 555, 625, 821, 1498, 559, 589, 1414, 833, 687, 6...
## $ Open          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Promo         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ StateHoliday  <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"...
## $ SchoolHoliday <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, ...

DayOfWeek is also an integer though we are not sure what the numbers represent. (Week starts on Monday in Europe.) So we will use a function wday() from the lubridate package to get the day of the week from a date.

sales$day <- wday(sales$Date, label = TRUE, abbr = TRUE) #creates 'ordered factor'
sales$day <- factor(sales$day, ordered=FALSE) # makes it a factor
head(select(sales, day, DayOfWeek), n=3)

## # A tibble: 3 x 2
##   day   DayOfWeek
##   <fct>     <dbl>
## 1 Fri           5
## 2 Fri           5
## 3 Fri           5

sales <- select(sales, -DayOfWeek) #not needed anymore

3. When are Rossemann stores open?

Let’s see how the StateHoliday correlates with opening of stores.

table(sales$StateHoliday, sales$Open)

##    
##          0      1
##   0 142677 843482
##   a  19566    694
##   b   6545    145
##   c   4029     71

The table above shows us how observations fall into different categories based on whether a store is open (column headings) and whether there is a state holiday and if so what type. It looks like most of the observations fall into no holiday (0) and open (1) category. Holiday a (public holiday) is associate with a lot more stores open than either Easter (b) or Christmas (c). Still, shockingly, we have 140 thousand observations when stores are closed.

IN-CLASS EXERCISE 1: Can you figure out what days of the week are Rossemann stores open?

The data has a variable indicating whether or not a store is open (Open). Let’s check if there are any sales when the stores are not open (Open=0).

summary(select(filter(sales,Open==0),Sales))

##      Sales  
##  Min.   :0  
##  1st Qu.:0  
##  Median :0  
##  Mean   :0  
##  3rd Qu.:0  
##  Max.   :0

Going forward, let’s focus on days when stores are open.

sales <- filter(sales, Open==1)
sales <- select(sales, -Open) #not needed anymore

4. How do sales vary across stores?

Let’s calculate average sales for each store so that we have an idea on how the stores differ in size. Let’s then plot average sales in a bar graph and a histogram. Note that the bars in the bar graph are ordered by minus average sales to so that the stores are ordered from biggest to smallest.

avsales <- sales %>% group_by(Store) %>% summarise(av_sales = mean(Sales))
ggplot(avsales, aes(x=reorder(Store,-av_sales),y=av_sales)) + geom_bar(stat="identity")

ggplot(avsales, aes(x=av_sales)) +  geom_histogram(binwidth = 100)

summary(avsales)

##      Store           av_sales    
##  Min.   :   1.0   Min.   : 2704  
##  1st Qu.: 279.5   1st Qu.: 5322  
##  Median : 558.0   Median : 6590  
##  Mean   : 558.0   Mean   : 6934  
##  3rd Qu.: 836.5   3rd Qu.: 7964  
##  Max.   :1115.0   Max.   :21757

5. How do sales vary over time?

IN-CLASS EXERCISE 2: Plot sales of store number 101 over time.

We would like to see if the pattern of fluctuations over time is typical. One option we have is to average across all stores for each date. However, such average would be dominated by the fluctuations at the largest stores. The other option is that for each store we calculate percent deviation form that store’s average sales. (We’ll call it dsales). For each store and date this variable will tell us by what percentage are sales higher or lower relative to that store’s average sales. We can then average these deviations for each day.

sales <- sales %>% 
  group_by(Store) %>% 
  mutate(dsales = (Sales-mean(Sales))/mean(Sales)*100)
summary(sales$dsales)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -100.000  -19.248   -3.382    0.000   14.941  477.364

We see that half of the store sales are between 19 percent below the store average, and 14 percent above the store average. Let’s now plot these fluctuations by date.

bydate <- sales %>% group_by(Date) %>% summarise(av_dsales=mean(dsales))
ggplot(bydate, aes(x=Date, y=av_dsales)) + geom_line()

IN-CLASS EXERCISE 3: Let’s plot average dsales by day of the week.

IN-CLASS EXERCISE 4: How about by Promo or StateHoliday?

6. Does competition affect sales?

The data set store has information on store characteristics including store’s competition.

stores <- read_csv("rossmann_store.csv")

## Parsed with column specification:
## cols(
##   Store = col_double(),
##   StoreType = col_character(),
##   Assortment = col_character(),
##   CompetitionDistance = col_double(),
##   CompetitionOpenSinceMonth = col_double(),
##   CompetitionOpenSinceYear = col_double(),
##   Promo2 = col_double(),
##   Promo2SinceWeek = col_double(),
##   Promo2SinceYear = col_double(),
##   PromoInterval = col_character()
## )

summary(select(stores, CompetitionDistance, CompetitionOpenSinceMonth, CompetitionOpenSinceYear))

##  CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear
##  Min.   :   20.0     Min.   : 1.000            Min.   :1900            
##  1st Qu.:  717.5     1st Qu.: 4.000            1st Qu.:2006            
##  Median : 2325.0     Median : 8.000            Median :2010            
##  Mean   : 5404.9     Mean   : 7.225            Mean   :2009            
##  3rd Qu.: 6882.5     3rd Qu.:10.000            3rd Qu.:2013            
##  Max.   :75860.0     Max.   :12.000            Max.   :2015            
##  NA's   :3           NA's   :354               NA's   :354

Competition distance varies from 20 meters to 75 kilometers. There are also three stores with NAs for competition distance. It is not clear what this means. Let’s ignore these for now, and plot average store sales against the log of distance to competition.

stores <- inner_join(avsales, stores, by="Store")
ggplot(stores, aes(x=CompetitionDistance, y=av_sales)) +  geom_point() +
    scale_x_continuous(trans="log")

## Warning: Removed 3 rows containing missing values (geom_point).

Looking again at the summary for the competition variables, it looks like 25% of stores got their competition in 2013 or later. Let’s create a date to indicate when competition arrived. Since we don’t know the exact day only month, let’s assume it happened on the 15th of the month.

Function paste() pastes together the characters in the variables we list as arguments, with sep= character in between. We use this to create a date. If date since competition is missing, let’s assume it has always been there by setting the date to January 1st, 2000. Notice that I used tidyverse’sif_else() instead of base R ifelse() because ifelse() somewhat oddly does not preserve the date format.

stores <- stores %>%
  mutate(compdate = as.Date(paste(CompetitionOpenSinceYear,
                                 CompetitionOpenSinceMonth,15, sep="-")),
         compdate = if_else(is.na(compdate), as.Date("2000-01-01"), compdate))

Let’s merge the compdate into our sales data set. This way we will be able to see if the entrance of competition has an effect on sales.

sales <- inner_join(sales,
                    select(stores, Store, compdate, StoreType, Assortment),
                    by="Store")  %>%
  mutate(competition = ifelse(Date < compdate, "no", "yes"))

bycompetition <- sales %>% group_by(competition) %>% summarize(av_dsales=mean(dsales))
ggplot(bycompetition, aes(x=competition,y=av_dsales)) + geom_bar(stat="identity")

IN-CLASS EXERCISE 5: Let’s limit to only stores that got competition during the period spanning our data so that we can compare sales before and after competition.

Let’s plot the time series of sales for store number 30. To take out the short term fluctuations by plotting geom_smooth().

ggplot(filter(sales, Store==30), aes(x=Date, y=Sales, color=competition)) + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

EXTENDED IN-CLASS EXERCISE: Could we create a data set with sales 100 days before and 100 after competition arrived?

Exercises:

Load in the rossman_train.csv data on Rossmann store sales. We have data for about two and half years. How many different dates in the data are State Holidays?
According to the data, which dates are considered Christmas holidays?
On what days do Germans celebrate Easter? Are these the same dates every year?
Using the lead() function, create a variable that equals “yes” when the store is open and it is the day before a holiday, and equals “no” otherwise.
Using only days when stores are open, are sales higher the day before the holiday than other days?
Are sales higher the day after the holiday than other days?