Lab 5: Collapsing data using group_by() and summarize()

Learning Objectives

  • relational database
  • exploring data
  • collapsing data using group_by() and summarize()*
  • working with character variables: nchar() and paste0()
  • more ways to summarize(): putting mean(), =n() etc. inside summarize()
  • grouping by more than one variable (group_by(var1, var2))

Background reading

Read chapter 5 on Data Transformations from R for Data Science.

1. Relational database

In this lab we will work with data from a marketing analytics firm dunnhumby. Data contains household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup, and Pancake Mix. The data is given to us in four different files: transactions, products, stores, promotions (they call it causal). The relationships among these four files are illustrated below. The full description of the data is here.



Data that is stored in multiple tables with relations among the tables is called a relational database.

2. Exploring the data

Let's load in the products and transactions data into R.

library(tidyverse)
library(stargazer)
#trans <- read_csv("https://www.dropbox.com/s/30gqnhk3tqsawfa/dh_transactions.csv?raw=1")
trans <- read_csv("dh_transactions.csv")
glimpse(trans)
## Rows: 5,197,681
## Columns: 11
## $ upc                 <chr> "7680850106", "3620000470", "1800028064", "9999...
## $ dollar_sales        <dbl> 0.80, 3.59, 2.25, 0.85, 2.19, 2.19, 3.45, 1.29,...
## $ units               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ time_of_transaction <chr> "1100", "1100", "1137", "1148", "1323", "1323",...
## $ geography           <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ week                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ household           <dbl> 125434, 125434, 108320, 162016, 89437, 89437, 1...
## $ store               <dbl> 244, 244, 244, 244, 244, 244, 244, 244, 244, 24...
## $ basket              <dbl> 1, 1, 2, 3, 4, 4, 5, 5, 6, 7, 8, 8, 8, 9, 9, 10...
## $ day                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ coupon              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

Transactions is a fairly large data set with over 5 million observations. We see that all variables are stored as numerical, with the exception of upc code and time of transaction. household or store ids are also numbers even though they don't have numerical interpretation.

Let's look at summary statistics of some of the key numerical variables.

trans <- as.data.frame(trans)
stargazer(select(trans,dollar_sales, units), type="text", median=TRUE)
## 
## ==============================================================================
## Statistic        N     Mean  St. Dev.   Min   Pctl(25) Median Pctl(75)   Max  
## ------------------------------------------------------------------------------
## dollar_sales 5,197,681 1.757  1.129   -11.760  0.990   1.500   2.190   153.140
## units        5,197,681 1.197  0.569      1       1       1       1       156  
## ------------------------------------------------------------------------------

We see that some of the transactions have negative values for dollar_sales. It is possible that these are merchandise returns. We will filter them out (although we should probably filter out the original purchase as well, but for now, we just drop the returns).

trans <- filter(trans, dollar_sales>0)
head(trans, n=3)
##          upc dollar_sales units time_of_transaction geography week household
## 1 7680850106         0.80     1                1100         2    1    125434
## 2 3620000470         3.59     1                1100         2    1    125434
## 3 1800028064         2.25     1                1137         2    1    108320
##   store basket day coupon
## 1   244      1   1      0
## 2   244      1   1      0
## 3   244      2   1      0

Let's now look at the products data.

#prods <- read_csv("https://www.dropbox.com/s/ja9am94rmwtte0n/dh_product_lookup.csv?raw=1")
prods <- read_csv("dh_product_lookup.csv")
## Parsed with column specification:
## cols(
##   upc = col_double(),
##   product_description = col_character(),
##   commodity = col_character(),
##   brand = col_character(),
##   product_size = col_character()
## )
glimpse(prods)
## Rows: 927
## Columns: 5
## $ upc                 <dbl> 111112360, 566300023, 566300028, 566300029, 566...
## $ product_description <chr> "VINCENT S ORIG MARINARA S", "PINE MOUNTAIN SYR...
## $ commodity           <chr> "pasta sauce", "syrups", "syrups", "syrups", "s...
## $ brand               <chr> "Vincent's", "Pine Mountain", "Miller", "Miller...
## $ product_size        <chr> "25 OZ", "40 OZ", "19 OZ", "12 OZ", "19 OZ", "2...
head(prods, n=3)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 5
##         upc product_description       commodity   brand         product_size
##       <dbl> <chr>                     <chr>       <chr>         <chr>       
## 1 111112360 VINCENT S ORIG MARINARA S pasta sauce Vincent's     25 OZ       
## 2 566300023 PINE MOUNTAIN SYRUP       syrups      Pine Mountain 40 OZ       
## 3 566300028 MILLER CANE SYRUP         syrups      Miller        19 OZ

There are 927 products in the products file. Looking at the number of factor levels, we see 872 product descriptions, 4 product categories (commodity) as we expected (pasta, pasta sauce, pancake mix, syrup), 131 brands etc.

3. Collapsing data using group_by() and summarize()

Suppose we wanted to calculate total sales for each product (as identified by the upc code). Right now each upc code appears in the transaction data many many times - note that we have only 927 products and over 5 million transactions. We want to aggregate to the product level. We can accomplish this using dplyr's group_by() and summarize() functions.

prodsales <- trans %>% 
  group_by(upc) %>% 
  summarize(tot_sales=sum(dollar_sales)) 
## `summarise()` ungrouping output (override with `.groups` argument)
glimpse(prodsales)
## Rows: 926
## Columns: 2
## $ upc       <chr> "0111112360", "0566300023", "0566300028", "0566300029", "...
## $ tot_sales <dbl> 184.68, 231.93, 3446.55, 2.98, 33.25, 36974.87, 12674.23,...
head(prodsales, n=3)
## # A tibble: 3 x 2
##   upc        tot_sales
##   <chr>          <dbl>
## 1 0111112360      185.
## 2 0566300023      232.
## 3 0566300028     3447.

The function summarize() effectively collapsed our transactions data to 926 observations. This is the number of unique upc codes in the transactions data. In the process of collapsing we asked R to calculate (sum()) the dollar sales variable by each upc code. For example we see that upc code 111112360 had total sales of 185 dollars.

4. Working with character variables: nchar() and paste0()

To make this data a bit more useful, let's merge in the information about these upc codes from the products table. Both tables have upc code allowing us to link them. However, upc is a character in trans, but a number in prods. This is because some of the upc codes in trans have zeros at the beginning, so read_csv read them as a characters (numbers don't start with zeros). These leading zeros were chopped off in the prods data, so read_csv read them as numbers. Since a UPC code should be a 10-digit code, rather than a number, let's fix prods by adding a leading zero to codes that are only 9-digit.

We will use function paste0() which joins two characters together. For example, the result of paste0("A", "B") is "AB". However, we only want add a zero done whereupc is missing the a zero, i.e. when upc is only 9 characters long. Therefore, we use nchar() function to count characters in a character variable and ifelse to add zeros only when the number of characters in upc is 9.

prods <- prods %>%
  mutate(upc= ifelse(nchar(as.character(upc))==9, paste0("0",upc), upc))
head(prods, n=3)
## # A tibble: 3 x 5
##   upc        product_description       commodity   brand         product_size
##   <chr>      <chr>                     <chr>       <chr>         <chr>       
## 1 0111112360 VINCENT S ORIG MARINARA S pasta sauce Vincent's     25 OZ       
## 2 0566300023 PINE MOUNTAIN SYRUP       syrups      Pine Mountain 40 OZ       
## 3 0566300028 MILLER CANE SYRUP         syrups      Miller        19 OZ

The collapsed transaction data has 926 observations whereas the products data set has 927. This is probably because we filtered out one of of the products when we dropped the negative sales. We will use inner_join() and thus not include this product in the merged data set. Let's also sort by sales so we can see the top selling products.

prodsales <- inner_join(prodsales,prods,by="upc")
prodsales <- arrange(prodsales, desc(tot_sales))
head(prodsales, n=3)
## # A tibble: 3 x 6
##   upc       tot_sales product_description     commodity   brand     product_size
##   <chr>         <dbl> <chr>                   <chr>       <chr>     <chr>       
## 1 36200002~   146484. RAGU TRADITIONAL PLAIN  pasta sauce Ragu      26 OZ       
## 2 51000025~   115527. PREGO REG SPAGHETTI SA~ pasta sauce Prego     26 OZ       
## 3 30000059~    98923. AUNT JEMIMA ORIGINL SY~ syrups      Aunt Jem~ 24 OZ

With about 146 thousand dollars in sales, Ragu Traditional Plain is the top seller in our data set.

5. More ways to summarize()

While collapsing data by group we can do more than just sum() another variable. For example, we can take an average (=mean()), count the number of transaction (=n()), get maximum or minimum (=max(), =min()), etc. Suppose we wanted to know the average unit price of each product and the number of transactions in each product.

prodsales <- trans %>% 
  group_by(upc) %>% 
  summarize(tot_sales=sum(dollar_sales), av_price=mean(dollar_sales/units), no_trans=n())
## `summarise()` ungrouping output (override with `.groups` argument)
prodsales <- inner_join(prodsales,prods,by="upc") %>%
  arrange(desc(av_price))
head(prodsales, n=3)
## # A tibble: 3 x 8
##   upc   tot_sales av_price no_trans product_descrip~ commodity brand
##   <chr>     <dbl>    <dbl>    <int> <chr>            <chr>     <chr>
## 1 6172~      12      12           1 KARO RED SYRUP   syrups    Karo 
## 2 2100~      68.9    11.5         6 KRAFT 5 PK MAC ~ pasta     Kraft
## 3 6971~      61.9     8.85        7 PATTY PIZZAILO ~ pasta sa~ Pats~
## # ... with 1 more variable: product_size <chr>

At the average price of 12 dollars, Karo Red Syrup is the most expensive product in our data set. Interestingly, it was purchased only once.

Let's plot the number of transactions against the average unit price.

ggplot(prodsales) + geom_point(aes(x=no_trans, y=av_price, color=commodity)) + 
  scale_y_continuous(trans = "log", breaks=c(1,2,4,10)) +
  scale_x_continuous(trans = "log", breaks=c(1,2,5,10,20,50,100,500,2000,10000,40000))

6. Grouping by more than one variable

By now, you could easily calculate sales of each brand (brands <- prodsales %>% group_by(brand) %>% summarize(tot_sale=sum(tot_sales). However, suppose you would like to calculate sales for each brand in each product category. This is easily done by simultanously grouping by brand and commodity.

brands <- prodsales %>% group_by(commodity, brand) %>% 
  summarize(sales=sum(tot_sales)) %>% 
  arrange(commodity, desc(sales)) 
## `summarise()` regrouping output by 'commodity' (override with `.groups` argument)
head(brands, n=3)
## # A tibble: 3 x 3
## # Groups:   commodity [1]
##   commodity     brand           sales
##   <chr>         <chr>           <dbl>
## 1 pancake mixes Aunt Jemima   206709.
## 2 pancake mixes Hungry Jack   111438.
## 3 pancake mixes Private Label  73620.

The data frame brands has all distinct combinations of brand and commodity. There are 151 of these combinations. I sort the data by commodity and -sales so that for each commodity the top selling brands appear at the top. We see that Aunt Jemima is the top selling brand of pancake mix. A couple rows down we see that Private Label is the top selling brand of pasta, etc.

IN-CLASS EXERCISE

  1. Load the NHL data and create win variable that equals one if the home team won and 0 otherwise.

  2. Calculate the number of wins for each team.



Exercises:

  1. Load in dunnhumby's transaction data and filter out negative sales.

  2. Create a data set that contains total spending by each household. What is average of spending across all households? What is the id of the highest spending household? How much did that household spend?

  3. In how many stores does a typical household shop? (Hint: this is a short question, but it requires a great deal of thought. There is about 500K households, some only shop in one store, some shop in more than one store. I want to know the average and the median of number of stores households shop at.)

  4. Load in store_lookup data from Nexus. It has information on the zip code each store is located in. Zipcode can have leading zeros (e.g. New Jersey and New Hampshire have zicodes that begin with zeros). Therefore, the proper way to store zipcodes is as a character rather than a number. Make sure that our zipcode variable is stored as a character. Also make sure that all your zipcodes are five-digit. If they are not, add leading zeros.

  5. Create a data set that contains total dollar value of sales and total volume of transactions (i.e. number of transactions) for each store. Merge this with the store_lookup data. Show the first 3 rows of this data.

  6. Are there zip codes that have more than one store? Aggregate your store data to the zip code level. Which zip code has the highest volume of sales? How much is that volume?

  7. Where are these stores? Use the US zip codes.csv from Nexus file to get the longitude and latitude of each zip code. Join it with your zip code level store data and plot the location of these zip codes using the leaflet() function from the leaflet package. Below is the syntax for plotting points in yourdata on a map. Good luck!

library(leaflet)
leaflet() %>% 
  addTiles() %>%
  addCircleMarkers(~LNG,~LAT, data=yourdata)