Lab 13: Reshaping data

Learning objectives

  • using the pivot_wider() function to transform a data from long to wide
  • using the pivot_longer() function to transform a data from wide to long
  • applications of reshaping including tidycensus()

1. From long to wide: toy example

In this lab we will learn how to reshape data. This means rearranging rows into columns and vise versa. Suppose we have the following data on sales. Each sale is recorded in a separate row. For each sale we know the date, product sold and the dollar value of the sale.

library(tidyverse)
long_df <- data.frame(date=c(1,1,2,3), product=c("a","b","a","b"), dollars=c(10,2,5,7))
long_df
##   date product dollars
## 1    1       a      10
## 2    1       b       2
## 3    2       a       5
## 4    3       b       7

Suppose that we would like the data to be arranged so that each row represents a date and columns describe what was sold on that date. In other words, we want to pivot rows into columns. We can accomplish this using function pivot_longer() in tidyr package (a part of tidyverse). The function takes three key arguments. First, it takes the original data. Second, names_from = specifies the column in the original data that we want spread. The values of this variable will appear as column headings in the new data frame.The third agument, values_from = specifies the column in the original data whose values will be in the cells inside the data frame.

wide_df <- long_df %>% pivot_wider(names_from = product, values_from = dollars) 
wide_df
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 3
##    date     a     b
##   <dbl> <dbl> <dbl>
## 1     1    10     2
## 2     2     5    NA
## 3     3    NA     7

We transformed our long data frame into a shorter but wider one. It has the same information, only arranged differently.

2. From wide to long: toy example

Sometimes we need to go from wide data to long data. For example if GDP data is organized as follows.

wide_df <- data.frame(year=c(2014,2015), q1=c(10,16), q2=c(12,17), q3=c(13,18), q4=c(15,19))
wide_df
##   year q1 q2 q3 q4
## 1 2014 10 12 13 15
## 2 2015 16 17 18 19

If we wanted to plot the time series we would first need to rearrange the the data so that consecutive observations on GPD are in separate rows. We can accomplish this using the pivot_longer() function. It takes several arguments. First, the data that is to be rearranged. Second, the list of columns that need to be reshaped. Third argument, names_to = specifies the name of a newly created variable that includes the names of the reshaped columns. Finally, the values_to = argument specifies the name of the new variable that will contain values.

long_df <- wide_df %>% pivot_longer(c("q1","q2","q3","q4"), names_to = "quarter" , values_to = "gdp")
long_df <- wide_df %>% pivot_longer(q1:q4, names_to = "quarter" , values_to = "gdp")
long_df <- wide_df %>% pivot_longer(-year, names_to = "quarter" , values_to = "gdp")
long_df
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 8 x 3
##    year quarter   gdp
##   <dbl> <chr>   <dbl>
## 1  2014 q1         10
## 2  2014 q2         12
## 3  2014 q3         13
## 4  2014 q4         15
## 5  2015 q1         16
## 6  2015 q2         17
## 7  2015 q3         18
## 8  2015 q4         19

IN-CLASS EXERCISE 1: Turn the GDP data back from long to wide.

IN-CLASS EXERCISE 2: The data frame below has information on stock prices for three companies. Put the stock prices for all companies in one column along with a day column and a variable that indicates which company the stock price belongs to.

stocks <- data.frame(day=c(1,2), aapl=c(100,110),fb=c(40,30),goog=c(500,600))
stocks
##   day aapl fb goog
## 1   1  100 40  500
## 2   2  110 30  600

3. From long to wide: real example

Suppose we wanted to find out if pancake mix tends to be purchased with syrups, and pasta with pasta sauce. We would like to know which products tend to be in the shopping basket together. Right now our data is organized by transaction. It may be easier for us to work with the data if it is organized by basket instead. Let's load in the transaction data and merge in product category and brand information.

trans <- read_csv("../dh_transactions.csv")
trans <- filter(trans, dollar_sales>0)
prods <- read_csv("../dh_product_lookup.csv")
prods$upc <- as.character(prods$upc) #need to fix upc
prods$upc <- ifelse(nchar(prods$upc)==9, paste0("0",prods$upc), prods$upc)
trans <- inner_join(trans, select(prods,upc, commodity, brand), by="upc")

As we know the transaction data contains about 900 different products. Since we are interested which product categories appear together rather than which specific products, let's simplify the range of products by aggregating up to the product category. For example, if a customer bought two pasta sauces in one basket, we will add up the cost of the two sauces and treat it as one purchase.

temp <- trans %>% group_by(household, basket, commodity) %>% summarize(spent = sum(dollar_sales))
## `summarise()` regrouping output by 'household', 'basket' (override with `.groups` argument)
head(temp, n=4)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 4 x 4
## # Groups:   household, basket [3]
##   household  basket commodity     spent
##       <dbl>   <dbl> <chr>         <dbl>
## 1         1  458810 pancake mixes  1.79
## 2         1  458810 syrups         1.5 
## 3         1 1424208 pasta sauce    3.22
## 4         1 1512312 pancake mixes  1.99

We created a data frame which we called temp (for temporary). We also note that the aggregating to product category level did not reduce the number of observations dramatically. This is because buying two kinds of pasta sauces, or two kinds of syrups in one basket is relatively rare.

Now we are ready to reshape our data set. We start with temp data set. The variable we want to 'spread' is commodity, i.e. we want pasta, pasta sauce, pancake mix and syrups to be new variables. And, we want the total spent on these categories to be the values in these variables.

basket <- temp %>% pivot_wider(names_from=commodity, values_from =  spent, values_fill=0)
head(basket, n=4)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 4 x 6
## # Groups:   household, basket [4]
##   household  basket `pancake mixes` syrups `pasta sauce` pasta
##       <dbl>   <dbl>           <dbl>  <dbl>         <dbl> <dbl>
## 1         1  458810            1.79   1.5           0        0
## 2         1 1424208            0      0             3.22     0
## 3         1 1512312            1.99   0             0        0
## 4         1 1518809            0      1.19          0        0

We see that there are over three million baskets. The first basket in our data had $1.79 worth of pancake mixes, $1.50 worth of syrups and no pasta or pasta sauce. Notice that if a basket did not contain a product category, the value is equal to zero. We set that using the values_fill= option in the pivot_wider() function above. Otherwise, the cell would have NAs in them.

4. Working with wide data

Now that we have the data at the basket level, we can see whether different product categories appear together. Since at this point we are not interested in the value of sales, but rather whether or not a product category is in the basket, let's create indicator variables for each product category.

basket <- basket %>%
  mutate(pancakey = ifelse(`pancake mixes` > 0,"yes","no"),
         pastay = ifelse(pasta > 0,"yes","no"),
         saucey = ifelse(`pasta sauce` > 0,"yes","no"),
         syrupy = ifelse(syrups > 0,"yes","no"))
head(basket, n=4)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 4 x 10
## # Groups:   household, basket [4]
##   household basket `pancake mixes` syrups `pasta sauce` pasta pancakey pastay
##       <dbl>  <dbl>           <dbl>  <dbl>         <dbl> <dbl> <chr>    <chr> 
## 1         1 4.59e5            1.79   1.5           0        0 yes      no    
## 2         1 1.42e6            0      0             3.22     0 no       no    
## 3         1 1.51e6            1.99   0             0        0 yes      no    
## 4         1 1.52e6            0      1.19          0        0 no       no    
## # ... with 2 more variables: saucey <chr>, syrupy <chr>

Let's use the crosstab() function from the descr package to see the association of different product categories.

library(descr)
crosstab(basket$saucey, basket$pastay, prop.t = TRUE, prop.c = TRUE, prop.r=TRUE, plot = FALSE)
##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |             Row Percent | 
## |          Column Percent | 
## |           Total Percent | 
## |-------------------------|
## 
## ============================================
##                  basket$pastay
## basket$saucey         no       yes     Total
## --------------------------------------------
## no               631203   1139818   1771021 
##                    35.6%     64.4%     53.4%
##                    46.5%     58.2%          
##                    19.0%     34.4%          
## --------------------------------------------
## yes              725120    818343   1543463 
##                    47.0%     53.0%     46.6%
##                    53.5%     41.8%          
##                    21.9%     24.7%          
## --------------------------------------------
## Total           1356323   1958161   3314484 
##                    40.9%     59.1%          
## ============================================

We see that pasta was in 1,958,161 baskets, and of those baskets 818,343 also had sauce in it. So, given that a basket has pasta, the chances that it also has sauce is about 42%.

IN-CLASS EXERCISE 3: Is Buddy the Elf out there? Given that a basket has pasta, what are the chances that it also has syrup?

EXTENDED EXERCISE: Is there a way to visualize associations of different products? Different brands?

5. From long to wide: another example

Sometimes reshaping data is necessary to create a good visualization. For example, in our transaction data we know that some customers used coupons. Suppose we want to compare average prices when coupon was used to average prices when coupon was not used. Let's summarize the transactions data by upc code and whether or not a coupon was used. While collapsing/summarizing the data we will calculate the average price for each upc code and whether or not a coupon was used.

prods <- trans %>% group_by(upc, coupon) %>% summarize(av_price=mean(dollar_sales/units))
## `summarise()` regrouping output by 'upc' (override with `.groups` argument)

It is possible that not all products had a coupon. If a product had a coupon it would be in our data set twice, if it did not have a coupon it would be there only once. So we can use mutate() to calculate how many times each upc code is in the data and filter out those that appear only once.

prods <- prods %>% 
  group_by(upc) %>%
  mutate(count=n()) %>% 
  filter(count==2) %>% 
  select(-count)
head(prods, n=4)
## # A tibble: 4 x 3
## # Groups:   upc [2]
##   upc        coupon av_price
##   <chr>       <dbl>    <dbl>
## 1 1160000004      0    2.79 
## 2 1160000004      1    2.05 
## 3 1330052501      0    0.690
## 4 1330052501      1    0.328

Right now we have a data set where each product appears twice, once with average price when a coupon was used (coupon=1) and once an average price when a coupon was not used (coupon=0). Suppose that we want to plot average prices with coupon against average prices with no coupon. To do this we would need to reshape the data so that each product appears in the data only once, but have two new columns - one that shows no coupon price and one with coupon price.

prods <-  prods %>% pivot_wider(names_from = coupon, values_from = av_price)
prods <- rename(prods, coupon_price = `1`, nocoupon_price = `0`)
head(prods, n=4)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 4 x 3
## # Groups:   upc [4]
##   upc        nocoupon_price coupon_price
##   <chr>               <dbl>        <dbl>
## 1 1160000004          2.79         2.05 
## 2 1330052501          0.690        0.328
## 3 1480071124          2.27         1.55 
## 4 1510000001          0.686        0.235

Now we are ready to plot the data.

ggplot(prods) + geom_point(aes(x=nocoupon_price, y=coupon_price)) + 
  geom_line(aes(y=coupon_price,x=coupon_price), color="red") #add a horizontal line

As expected, in vast majority of cases coupon prices are lower than no coupon prices. Interestingly, coupons appear to lower the price by a bit less than a dollar. This is true for products priced in the one to two dollar range as well as more expensive ones.

6. Advanced reshaping

Suppose you had the following data on population and GDP for two countries in two years. Reshape this data so that each country's population and each country's GDP are in separate columns, and each year is in a separate row. Give the new variables informative names.

data <- data.frame(year=c(1,1,2,2), country=c("a","b","a","b"), pop=c(1,2,3,4), gdp=c(5,6,7,8))
data
##   year country pop gdp
## 1    1       a   1   5
## 2    1       b   2   6
## 3    2       a   3   7
## 4    2       b   4   8

Since we have two sets of values that we would like to spread, pop and gdp, we can specify that the values come from two columns.

data_wide <- data %>% pivot_wider(names_from=country, values_from =c("pop", "gdp"))
data_wide
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 2 x 5
##    year pop_a pop_b gdp_a gdp_b
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1     1     2     5     6
## 2     2     3     4     7     8

If we want the names to be informative, we can use the names_glue= option.

data_wide <- data %>% pivot_wider(names_from = country, values_from = c("pop", "gdp"),
                                  names_glue="{.value}_of_country_{country}")
data_wide
## # A tibble: 2 x 5
##    year pop_of_country_a pop_of_country_b gdp_of_country_a gdp_of_country_b
##   <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
## 1     1                1                2                5                6
## 2     2                3                4                7                8

7. Another application of reshaping

Package tidycensus makes accessing Census data very easy. You can specify which variables you need, for which years, for what geographies (zip codes, congressional districts, counties, states, etc.). You need to request your own API key here.

library(tidycensus)
census_api_key("your Census API key")

#you need to know which Census program you want to use
#American Community Survey (ACS) has the most comprehensive demographic, social and economic statistics 
#function load_variables let's us browse available variables, let's use ACS's 5-year estimates
vars <- load_variables(2018, "acs5/profile")
#another way to browse available variables is look through data profile variables 
#https://api.census.gov/data/2017/acs/acs5/profile/groups.html

#function get_acs() actually gets the data
acs <- get_acs(geography = "county", 
             variables = c("DP05_0001", "DP05_0009", "DP05_0037","DP03_0062"), 
             year = 2018)
head(acs)

As we can see, get_acs gives us each variable value on a separate row. In most analyses we probably need each variable in a separate column - having one row for each county.

acs_reshaped <- acs %>%
  select(NAME,variable, estimate) %>%
  pivot_wider(values_from = estimate,names_from = variable) %>%
  rename(popuplation=DP05_0001,
         pop20_24=DP05_0009,
         pop_white =DP05_0037,
         med_income = DP03_0062) 

head(acs_reshaped)


Exercises:

  1. Suppose you have the following data on population for two countries for two years. Reshape this data so that each country's population is in a separate column, and each year is on just one row.
data <- data.frame(year=c(1,1,2,2), country=c("a","b","a","b"), pop=c(1,2,3,4))
data
##   year country pop
## 1    1       a   1
## 2    1       b   2
## 3    2       a   3
## 4    2       b   4
  1. Convert your result from above back to a long data frame,i.e. back to its original shape.

  2. Run the code below. Would you say that the resulting data frame is wide or long?

#put tickers of a bunch of stocks in a vector
stocks <- c("FB","AMZN","AAPL", "GE") #facebook, google, apple, united airlines
data <- data.frame() #create an empty data frame
#make a loop that grabs the tickers from the 'stocks' vector and gets the data from yahoo finance
for (i in 1:length(stocks)) {
  temp <- read_csv(paste("https://www.quandl.com/api/v3/datasets/WIKI/",stocks[i],".csv", sep=""))
  temp$stock <- stocks[i] #create a key that is the stock ticker
  temp <- temp %>% rename(close=`Adj. Close`) %>% #rename adj. close
                   select(Date, stock, close) #keep only Date, stock and Adj.Close
  #append temp to data, this will gradually stack data for each stock on top of each other
  data <- bind_rows(data,temp) 
}

#an alternative online source of stock data
#temp <- read_csv(paste0("https://query1.finance.yahoo.com/v7/finance/download/",stocks[i],"?period1=1&period2=2000000000&interval=1d&events=history"),
#                 col_types=cols(Date=col_date(),
#                                `Adj Close`=col_double()))
#data <- filter(data, Date>as.Date("2003-01-01")) #trim the time frame
  1. Use the above data frame to create a time plot of stock prices of all four companies in one graph. Use a log scale on the y-axis. (Hint: You can map the color aesthetic to stock.)

  2. Turn your data from long to wide with stock prices for each company as separate columns. Do you have any missing values? Why?

  3. Calculate the correlation coefficients among the four stock prices. (Hint 1: Use function cor() which takes as an argument a data frame of numerical variables. Hint 2: Use option use="pairwise.complete.obs to tell the function to ignore missing values in each pair of variables.) Which pair of stock prices has the lowest correlation coefficient? Which has the highest? Does that make sense to you?

  4. Use package corrplot to visualize the correlations. See the documentation here (Hint: You need to put the output of cor() into a matrix object, like this correlations <- cor().)