This document illustrates data manipulation and regression estimation in R. It accompanies Sample Annotated Paper in Econometrics published by the Journal of Economic Education in 2007. The sample annotated paper illustrated the structure of an econometrics paper: the sections, presentation of descriptive statistics, results etc. In contrast, this document illustrates how the data for an empirical paper can be put together. It shows the process through which empirical researchers retrieve, manipulate and analyze data. For more on using R Markdown in teaching econometrics see this paper.
We need data two pieces of information: pay inequality and team performance. The USA Today has data on salaries of individual players in the MLB. Baseball-reference.com has data on team performance. All of this data is also compiled in Lahman database which has standardized player and team names, and provides the data in a convenient .csv
format. We will use two of the Laham files, Salaries.csv
and Teams.csv
, as our sources.
library(tidyverse)
#salaries <- read_csv("https://github.com/chadwickbureau/baseballdatabank/blob/master/core/Salaries.csv?raw=True")
#teams <- read_csv("https://github.com/chadwickbureau/baseballdatabank/blob/master/core/Teams.csv?raw=True")
salaries <- read_csv("Salaries.csv")
teams <- read_csv("Teams.csv")
The teams data set has information on each team’s performance for each season. It includes variables on the number of games played, G
and the number of games won, W
. Let’s create a new variable winpct
which is the percentage of games won.
teams <- mutate(teams, winpct = W/G*100)
The teams data set goes back to the 19th century. Let’s limit our data to after 2010.
teams <- filter(teams, yearID>2010)
To simplify our data set, let’s only keep the variables that we will use in this study.
teams <- select(teams, yearID, teamID, winpct)
We have data on individual players’ salaries, but we need to calculate the share of the top 20% in the total payroll. This requires us to identify to which quintile a player’s salary belongs to within each team and each season. We can use function ntile()
which creates a new variable indicating which n-tile the observation falls into. For example, ntile(x,10)
would tell us which decile each observation in x belongs to. In addition, we need to calculate total payroll (by summing up salaries) for each team and season. Let’s also divide payroll by one million so that it is measured in millions.
salaries <- salaries %>%
group_by(yearID, teamID) %>%
mutate(payroll=sum(salary)/1000000, pctile=ntile(salary,5))
head(salaries, n=3)
## # A tibble: 3 x 7
## # Groups: yearID, teamID [1]
## yearID teamID lgID playerID salary payroll pctile
## <int> <chr> <chr> <chr> <int> <dbl> <int>
## 1 1985 ATL NL barkele01 870000 14.8 5
## 2 1985 ATL NL bedrost01 550000 14.8 3
## 3 1985 ATL NL benedbr01 545000 14.8 2
Let’s now focus on players in the top (the 5th) quintile and add up their salaries. We can then divide their combined salary by total payroll to obtain top20_share
.
salaries_by_team <- salaries %>%
filter(pctile == 5) %>%
group_by(yearID, teamID, payroll) %>%
summarize(top20 = sum(salary)/1000000) %>%
mutate(top20_share = top20/payroll*100)
head(salaries_by_team, n=3)
## # A tibble: 3 x 5
## # Groups: yearID, teamID [3]
## yearID teamID payroll top20 top20_share
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1985 ATL 14.8 5.35 36.1
## 2 1985 BAL 11.6 4.09 35.4
## 3 1985 BOS 10.9 4.57 41.9
Now we have two team/season level data that need to be joined together using yearID
and teamID
as keys to match the relevant observations. We will use inner_join
to focus on teams and seasons that exist in both data sets.
teams <- inner_join(teams, salaries_by_team, by=c("yearID", "teamID"))
head(teams, n=3)
## # A tibble: 3 x 6
## yearID teamID winpct payroll top20 top20_share
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2011 ARI 58.0 53.6 25.7 47.8
## 2 2011 ATL 54.9 87.0 54.1 62.2
## 3 2011 BAL 42.6 85.3 41.9 49.1
Let’s create a descriptive statistics table of the relevant variables.
library(stargazer)
stargazer(as.data.frame(select(teams,winpct, payroll,top20_share)), type="text", median = TRUE, digits=2)
##
## ===================================================
## Statistic N Mean St. Dev. Min Median Max
## ---------------------------------------------------
## winpct 180 50.00 6.75 31.48 50.00 63.58
## payroll 180 106.71 43.70 17.89 97.63 231.98
## top20_share 180 58.28 7.41 35.94 57.24 81.37
## ---------------------------------------------------
As part of data exploration, let’s create a scatter plot of winning percentage against top 20% share.
ggplot(data=teams, aes(x=top20_share,y=winpct, label=teamID, color=as.factor(yearID))) + geom_text(size=3) +
ggtitle("Team inequality and performance of MLB teams, 2011-2016") +
xlab("Share of payroll earned by top 20%") + ylab("Percent of games won") +
scale_color_discrete(name="year") + theme_minimal()
Finally, let’s run three regression models linking performance to pay inequality and total payroll.
m1 <- lm(winpct ~ top20_share, data=teams)
m2 <- lm(winpct ~ top20_share + payroll, data=teams)
m3 <- lm(winpct ~ top20_share + log(payroll) , data=teams)
#creates a regression table
stargazer(m1,m2,m3, type="text")
##
## ==========================================================================================
## Dependent variable:
## ----------------------------------------------------------------------
## winpct
## (1) (2) (3)
## ------------------------------------------------------------------------------------------
## top20_share -0.178*** -0.141** -0.154**
## (0.067) (0.064) (0.063)
##
## payroll 0.049***
## (0.011)
##
## log(payroll) 5.498***
## (1.113)
##
## Constant 60.350*** 52.924*** 33.747***
## (3.931) (4.071) (6.531)
##
## ------------------------------------------------------------------------------------------
## Observations 180 180 180
## R2 0.038 0.139 0.155
## Adjusted R2 0.033 0.129 0.145
## Residual Std. Error 6.636 (df = 178) 6.297 (df = 177) 6.238 (df = 177)
## F Statistic 7.051*** (df = 1; 178) 14.259*** (df = 2; 177) 16.193*** (df = 2; 177)
## ==========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As in the original sample paper, pay inequality is associated with lower team performance even after controlling for total payroll. The magnitude of the effect is that a one percentage point increase in top 20%’s share in payroll is associated with about 0.15 percentage point decrease in winning percentage. This seems a pretty modest effect - on average, it takes about a full percentage point change in the winning percentage to change a team’s standing.