pivot_wider()
function to transform a data from long to widepivot_longer()
function to transform a data from wide to longtidycensus()
In this lab we will learn how to reshape data. This means rearranging rows into columns and vise versa. Suppose we have the following data on sales. Each sale is recorded in a separate row. For each sale we know the date, product sold and the dollar value of the sale.
library(tidyverse)
long_df <- data.frame(date=c(1,1,2,3), product=c("a","b","a","b"), dollars=c(10,2,5,7))
long_df
## date product dollars
## 1 1 a 10
## 2 1 b 2
## 3 2 a 5
## 4 3 b 7
Suppose that we would like the data to be arranged so that each row represents a date and columns describe what was sold on that date. In other words, we want to pivot rows into columns. We can accomplish this using function pivot_longer()
in tidyr
package (a part of tidyverse
). The function takes three key arguments. First, it takes the original data. Second, names_from =
specifies the column in the original data that we want spread. The values of this variable will appear as column headings in the new data frame.The third agument, values_from =
specifies the column in the original data whose values will be in the cells inside the data frame.
wide_df <- long_df %>% pivot_wider(names_from = product, values_from = dollars)
wide_df
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 3
## date a b
## <dbl> <dbl> <dbl>
## 1 1 10 2
## 2 2 5 NA
## 3 3 NA 7
We transformed our long data frame into a shorter but wider one. It has the same information, only arranged differently.
Sometimes we need to go from wide data to long data. For example if GDP data is organized as follows.
wide_df <- data.frame(year=c(2014,2015), q1=c(10,16), q2=c(12,17), q3=c(13,18), q4=c(15,19))
wide_df
## year q1 q2 q3 q4
## 1 2014 10 12 13 15
## 2 2015 16 17 18 19
If we wanted to plot the time series we would first need to rearrange the the data so that consecutive observations on GPD are in separate rows. We can accomplish this using the pivot_longer()
function. It takes several arguments. First, the data that is to be rearranged. Second, the list of columns that need to be reshaped. Third argument, names_to =
specifies the name of a newly created variable that includes the names of the reshaped columns. Finally, the values_to =
argument specifies the name of the new variable that will contain values.
long_df <- wide_df %>% pivot_longer(c("q1","q2","q3","q4"), names_to = "quarter" , values_to = "gdp")
long_df <- wide_df %>% pivot_longer(q1:q4, names_to = "quarter" , values_to = "gdp")
long_df <- wide_df %>% pivot_longer(-year, names_to = "quarter" , values_to = "gdp")
long_df
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 8 x 3
## year quarter gdp
## <dbl> <chr> <dbl>
## 1 2014 q1 10
## 2 2014 q2 12
## 3 2014 q3 13
## 4 2014 q4 15
## 5 2015 q1 16
## 6 2015 q2 17
## 7 2015 q3 18
## 8 2015 q4 19
IN-CLASS EXERCISE 1: Turn the GDP data back from long to wide.
IN-CLASS EXERCISE 2: The data frame below has information on stock prices for three companies. Put the stock prices for all companies in one column along with a day column and a variable that indicates which company the stock price belongs to.
stocks <- data.frame(day=c(1,2), aapl=c(100,110),fb=c(40,30),goog=c(500,600))
stocks
## day aapl fb goog
## 1 1 100 40 500
## 2 2 110 30 600
Suppose we wanted to find out if pancake mix tends to be purchased with syrups, and pasta with pasta sauce. We would like to know which products tend to be in the shopping basket together. Right now our data is organized by transaction. It may be easier for us to work with the data if it is organized by basket instead. Let's load in the transaction data and merge in product category and brand information.
trans <- read_csv("../dh_transactions.csv")
trans <- filter(trans, dollar_sales>0)
prods <- read_csv("../dh_product_lookup.csv")
prods$upc <- as.character(prods$upc) #need to fix upc
prods$upc <- ifelse(nchar(prods$upc)==9, paste0("0",prods$upc), prods$upc)
trans <- inner_join(trans, select(prods,upc, commodity, brand), by="upc")
As we know the transaction data contains about 900 different products. Since we are interested which product categories appear together rather than which specific products, let's simplify the range of products by aggregating up to the product category. For example, if a customer bought two pasta sauces in one basket, we will add up the cost of the two sauces and treat it as one purchase.
temp <- trans %>% group_by(household, basket, commodity) %>% summarize(spent = sum(dollar_sales))
## `summarise()` regrouping output by 'household', 'basket' (override with `.groups` argument)
head(temp, n=4)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 4 x 4
## # Groups: household, basket [3]
## household basket commodity spent
## <dbl> <dbl> <chr> <dbl>
## 1 1 458810 pancake mixes 1.79
## 2 1 458810 syrups 1.5
## 3 1 1424208 pasta sauce 3.22
## 4 1 1512312 pancake mixes 1.99
We created a data frame which we called temp
(for temporary). We also note that the aggregating to product category level did not reduce the number of observations dramatically. This is because buying two kinds of pasta sauces, or two kinds of syrups in one basket is relatively rare.
Now we are ready to reshape our data set. We start with temp
data set. The variable we want to 'spread' is commodity
, i.e. we want pasta, pasta sauce, pancake mix and syrups to be new variables. And, we want the total spent on these categories to be the values in these variables.
basket <- temp %>% pivot_wider(names_from=commodity, values_from = spent, values_fill=0)
head(basket, n=4)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 4 x 6
## # Groups: household, basket [4]
## household basket `pancake mixes` syrups `pasta sauce` pasta
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 458810 1.79 1.5 0 0
## 2 1 1424208 0 0 3.22 0
## 3 1 1512312 1.99 0 0 0
## 4 1 1518809 0 1.19 0 0
We see that there are over three million baskets. The first basket in our data had $1.79 worth of pancake mixes, $1.50 worth of syrups and no pasta or pasta sauce. Notice that if a basket did not contain a product category, the value is equal to zero. We set that using the values_fill=
option in the pivot_wider()
function above. Otherwise, the cell would have NAs in them.
Now that we have the data at the basket level, we can see whether different product categories appear together. Since at this point we are not interested in the value of sales, but rather whether or not a product category is in the basket, let's create indicator variables for each product category.
basket <- basket %>%
mutate(pancakey = ifelse(`pancake mixes` > 0,"yes","no"),
pastay = ifelse(pasta > 0,"yes","no"),
saucey = ifelse(`pasta sauce` > 0,"yes","no"),
syrupy = ifelse(syrups > 0,"yes","no"))
head(basket, n=4)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 4 x 10
## # Groups: household, basket [4]
## household basket `pancake mixes` syrups `pasta sauce` pasta pancakey pastay
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 4.59e5 1.79 1.5 0 0 yes no
## 2 1 1.42e6 0 0 3.22 0 no no
## 3 1 1.51e6 1.99 0 0 0 yes no
## 4 1 1.52e6 0 1.19 0 0 no no
## # ... with 2 more variables: saucey <chr>, syrupy <chr>
Let's use the crosstab()
function from the descr
package to see the association of different product categories.
library(descr)
crosstab(basket$saucey, basket$pastay, prop.t = TRUE, prop.c = TRUE, prop.r=TRUE, plot = FALSE)
## Cell Contents
## |-------------------------|
## | Count |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## ============================================
## basket$pastay
## basket$saucey no yes Total
## --------------------------------------------
## no 631203 1139818 1771021
## 35.6% 64.4% 53.4%
## 46.5% 58.2%
## 19.0% 34.4%
## --------------------------------------------
## yes 725120 818343 1543463
## 47.0% 53.0% 46.6%
## 53.5% 41.8%
## 21.9% 24.7%
## --------------------------------------------
## Total 1356323 1958161 3314484
## 40.9% 59.1%
## ============================================
We see that pasta was in 1,958,161 baskets, and of those baskets 818,343 also had sauce in it. So, given that a basket has pasta, the chances that it also has sauce is about 42%.
IN-CLASS EXERCISE 3: Is Buddy the Elf out there? Given that a basket has pasta, what are the chances that it also has syrup?
EXTENDED EXERCISE: Is there a way to visualize associations of different products? Different brands?
Sometimes reshaping data is necessary to create a good visualization. For example, in our transaction data we know that some customers used coupons. Suppose we want to compare average prices when coupon was used to average prices when coupon was not used. Let's summarize the transactions data by upc code and whether or not a coupon was used. While collapsing/summarizing the data we will calculate the average price for each upc code and whether or not a coupon was used.
prods <- trans %>% group_by(upc, coupon) %>% summarize(av_price=mean(dollar_sales/units))
## `summarise()` regrouping output by 'upc' (override with `.groups` argument)
It is possible that not all products had a coupon. If a product had a coupon it would be in our data set twice, if it did not have a coupon it would be there only once. So we can use mutate()
to calculate how many times each upc code is in the data and filter out those that appear only once.
prods <- prods %>%
group_by(upc) %>%
mutate(count=n()) %>%
filter(count==2) %>%
select(-count)
head(prods, n=4)
## # A tibble: 4 x 3
## # Groups: upc [2]
## upc coupon av_price
## <chr> <dbl> <dbl>
## 1 1160000004 0 2.79
## 2 1160000004 1 2.05
## 3 1330052501 0 0.690
## 4 1330052501 1 0.328
Right now we have a data set where each product appears twice, once with average price when a coupon was used (coupon=1
) and once an average price when a coupon was not used (coupon=0
). Suppose that we want to plot average prices with coupon against average prices with no coupon. To do this we would need to reshape the data so that each product appears in the data only once, but have two new columns - one that shows no coupon price and one with coupon price.
prods <- prods %>% pivot_wider(names_from = coupon, values_from = av_price)
prods <- rename(prods, coupon_price = `1`, nocoupon_price = `0`)
head(prods, n=4)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 4 x 3
## # Groups: upc [4]
## upc nocoupon_price coupon_price
## <chr> <dbl> <dbl>
## 1 1160000004 2.79 2.05
## 2 1330052501 0.690 0.328
## 3 1480071124 2.27 1.55
## 4 1510000001 0.686 0.235
Now we are ready to plot the data.
ggplot(prods) + geom_point(aes(x=nocoupon_price, y=coupon_price)) +
geom_line(aes(y=coupon_price,x=coupon_price), color="red") #add a horizontal line
As expected, in vast majority of cases coupon prices are lower than no coupon prices. Interestingly, coupons appear to lower the price by a bit less than a dollar. This is true for products priced in the one to two dollar range as well as more expensive ones.
Suppose you had the following data on population and GDP for two countries in two years. Reshape this data so that each country's population and each country's GDP are in separate columns, and each year is in a separate row. Give the new variables informative names.
data <- data.frame(year=c(1,1,2,2), country=c("a","b","a","b"), pop=c(1,2,3,4), gdp=c(5,6,7,8))
data
## year country pop gdp
## 1 1 a 1 5
## 2 1 b 2 6
## 3 2 a 3 7
## 4 2 b 4 8
Since we have two sets of values that we would like to spread, pop
and gdp
, we can specify that the values come from two columns.
data_wide <- data %>% pivot_wider(names_from=country, values_from =c("pop", "gdp"))
data_wide
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 2 x 5
## year pop_a pop_b gdp_a gdp_b
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 2 5 6
## 2 2 3 4 7 8
If we want the names to be informative, we can use the names_glue=
option.
data_wide <- data %>% pivot_wider(names_from = country, values_from = c("pop", "gdp"),
names_glue="{.value}_of_country_{country}")
data_wide
## # A tibble: 2 x 5
## year pop_of_country_a pop_of_country_b gdp_of_country_a gdp_of_country_b
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 2 5 6
## 2 2 3 4 7 8
Package tidycensus
makes accessing Census data very easy. You can specify which variables you need, for which years, for what geographies (zip codes, congressional districts, counties, states, etc.). You need to request your own API key here.
library(tidycensus)
census_api_key("your Census API key")
#you need to know which Census program you want to use
#American Community Survey (ACS) has the most comprehensive demographic, social and economic statistics
#function load_variables let's us browse available variables, let's use ACS's 5-year estimates
vars <- load_variables(2018, "acs5/profile")
#another way to browse available variables is look through data profile variables
#https://api.census.gov/data/2017/acs/acs5/profile/groups.html
#function get_acs() actually gets the data
acs <- get_acs(geography = "county",
variables = c("DP05_0001", "DP05_0009", "DP05_0037","DP03_0062"),
year = 2018)
head(acs)
As we can see, get_acs
gives us each variable value on a separate row. In most analyses we probably need each variable in a separate column - having one row for each county.
acs_reshaped <- acs %>%
select(NAME,variable, estimate) %>%
pivot_wider(values_from = estimate,names_from = variable) %>%
rename(popuplation=DP05_0001,
pop20_24=DP05_0009,
pop_white =DP05_0037,
med_income = DP03_0062)
head(acs_reshaped)
data <- data.frame(year=c(1,1,2,2), country=c("a","b","a","b"), pop=c(1,2,3,4))
data
## year country pop
## 1 1 a 1
## 2 1 b 2
## 3 2 a 3
## 4 2 b 4
Convert your result from above back to a long data frame,i.e. back to its original shape.
Run the code below. Would you say that the resulting data frame is wide or long?
#put tickers of a bunch of stocks in a vector
stocks <- c("FB","AMZN","AAPL", "GE") #facebook, google, apple, united airlines
data <- data.frame() #create an empty data frame
#make a loop that grabs the tickers from the 'stocks' vector and gets the data from yahoo finance
for (i in 1:length(stocks)) {
temp <- read_csv(paste("https://www.quandl.com/api/v3/datasets/WIKI/",stocks[i],".csv", sep=""))
temp$stock <- stocks[i] #create a key that is the stock ticker
temp <- temp %>% rename(close=`Adj. Close`) %>% #rename adj. close
select(Date, stock, close) #keep only Date, stock and Adj.Close
#append temp to data, this will gradually stack data for each stock on top of each other
data <- bind_rows(data,temp)
}
#an alternative online source of stock data
#temp <- read_csv(paste0("https://query1.finance.yahoo.com/v7/finance/download/",stocks[i],"?period1=1&period2=2000000000&interval=1d&events=history"),
# col_types=cols(Date=col_date(),
# `Adj Close`=col_double()))
#data <- filter(data, Date>as.Date("2003-01-01")) #trim the time frame
Use the above data frame to create a time plot of stock prices of all four companies in one graph. Use a log scale on the y-axis. (Hint: You can map the color aesthetic to stock
.)
Turn your data from long to wide with stock prices for each company as separate columns. Do you have any missing values? Why?
Calculate the correlation coefficients among the four stock prices. (Hint 1: Use function cor()
which takes as an argument a data frame of numerical variables. Hint 2: Use option use="pairwise.complete.obs
to tell the function to ignore missing values in each pair of variables.) Which pair of stock prices has the lowest correlation coefficient? Which has the highest? Does that make sense to you?
Use package corrplot
to visualize the correlations. See the documentation here (Hint: You need to put the output of cor()
into a matrix object, like this correlations <- cor()
.)