Does pay inequality affect team performance?

This document illustrates data manipulation and regression estimation in R. It accompanies Sample Annotated Paper in Econometrics published by the Journal of Economic Education in 2007. The sample annotated paper illustrated the structure of an econometrics paper: the sections, presentation of descriptive statistics, results etc. In contrast, this document illustrates how the data for an empirical paper can be put together. It shows the process through which empirical researchers retrieve, manipulate and analyze data. For more on using R Markdown in teaching econometrics see this paper.

1. Reading or retrieving data

We need data two pieces of information: pay inequality and team performance. The USA Today has data on salaries of individual players in the MLB. Baseball-reference.com has data on team performance. All of this data is also compiled in Lahman database which has standardized player and team names, and provides the data in a convenient .csv format. We will use two of the Laham files, Salaries.csv and Teams.csv, as our sources.

library(tidyverse)
#salaries <- read_csv("https://github.com/chadwickbureau/baseballdatabank/blob/master/core/Salaries.csv?raw=True")
#teams <- read_csv("https://github.com/chadwickbureau/baseballdatabank/blob/master/core/Teams.csv?raw=True")
salaries <- read_csv("Salaries.csv")
teams <- read_csv("Teams.csv")

2. Creating new variables

The teams data set has information on each team’s performance for each season. It includes variables on the number of games played, G and the number of games won, W. Let’s create a new variable winpct which is the percentage of games won.

teams <- mutate(teams, winpct = W/G*100)

3. Selecting observations

The teams data set goes back to the 19th century. Let’s limit our data to after 2010.

teams <- filter(teams, yearID>2010)

4. Selecting variables

To simplify our data set, let’s only keep the variables that we will use in this study.

teams <- select(teams, yearID, teamID, winpct)

5. Aggregating by groups

We have data on individual players’ salaries, but we need to calculate the share of the top 20% in the total payroll. This requires us to identify to which quintile a player’s salary belongs to within each team and each season. We can use function ntile() which creates a new variable indicating which n-tile the observation falls into. For example, ntile(x,10) would tell us which decile each observation in x belongs to. In addition, we need to calculate total payroll (by summing up salaries) for each team and season. Let’s also divide payroll by one million so that it is measured in millions.

salaries <- salaries %>%
  group_by(yearID, teamID) %>%
  mutate(payroll=sum(salary)/1000000, pctile=ntile(salary,5))
head(salaries, n=3)
## # A tibble: 3 x 7
## # Groups: yearID, teamID [1]
##   yearID teamID lgID  playerID  salary payroll pctile
##    <int> <chr>  <chr> <chr>      <int>   <dbl>  <int>
## 1   1985 ATL    NL    barkele01 870000    14.8      5
## 2   1985 ATL    NL    bedrost01 550000    14.8      3
## 3   1985 ATL    NL    benedbr01 545000    14.8      2

Let’s now focus on players in the top (the 5th) quintile and add up their salaries. We can then divide their combined salary by total payroll to obtain top20_share.

salaries_by_team  <- salaries %>%
  filter(pctile == 5) %>%
  group_by(yearID, teamID, payroll) %>%
  summarize(top20 = sum(salary)/1000000) %>%
  mutate(top20_share = top20/payroll*100)
head(salaries_by_team, n=3)
## # A tibble: 3 x 5
## # Groups: yearID, teamID [3]
##   yearID teamID payroll top20 top20_share
##    <int> <chr>    <dbl> <dbl>       <dbl>
## 1   1985 ATL       14.8  5.35        36.1
## 2   1985 BAL       11.6  4.09        35.4
## 3   1985 BOS       10.9  4.57        41.9

6. Joining data

Now we have two team/season level data that need to be joined together using yearID and teamID as keys to match the relevant observations. We will use inner_join to focus on teams and seasons that exist in both data sets.

teams <- inner_join(teams, salaries_by_team, by=c("yearID", "teamID"))
head(teams, n=3)
## # A tibble: 3 x 6
##   yearID teamID winpct payroll top20 top20_share
##    <int> <chr>   <dbl>   <dbl> <dbl>       <dbl>
## 1   2011 ARI      58.0    53.6  25.7        47.8
## 2   2011 ATL      54.9    87.0  54.1        62.2
## 3   2011 BAL      42.6    85.3  41.9        49.1

7. Creating descriptive statistics table

Let’s create a descriptive statistics table of the relevant variables.

library(stargazer)
stargazer(as.data.frame(select(teams,winpct, payroll,top20_share)), type="text", median = TRUE, digits=2)
## 
## ===================================================
## Statistic    N   Mean  St. Dev.  Min  Median  Max  
## ---------------------------------------------------
## winpct      180 50.00    6.75   31.48 50.00  63.58 
## payroll     180 106.71  43.70   17.89 97.63  231.98
## top20_share 180 58.28    7.41   35.94 57.24  81.37 
## ---------------------------------------------------

8. Graphing data

As part of data exploration, let’s create a scatter plot of winning percentage against top 20% share.

ggplot(data=teams, aes(x=top20_share,y=winpct, label=teamID, color=as.factor(yearID))) + geom_text(size=3) +
  ggtitle("Team inequality and performance of MLB teams, 2011-2016") + 
  xlab("Share of payroll earned by top 20%") + ylab("Percent of games won") +
  scale_color_discrete(name="year") + theme_minimal()

9. Regression analysis

Finally, let’s run three regression models linking performance to pay inequality and total payroll.

m1 <- lm(winpct ~ top20_share, data=teams)
m2 <- lm(winpct ~ top20_share + payroll, data=teams)
m3 <- lm(winpct ~ top20_share + log(payroll) , data=teams)
#creates a regression table
stargazer(m1,m2,m3, type="text")
## 
## ==========================================================================================
##                                              Dependent variable:                          
##                     ----------------------------------------------------------------------
##                                                     winpct                                
##                              (1)                     (2)                     (3)          
## ------------------------------------------------------------------------------------------
## top20_share               -0.178***               -0.141**                -0.154**        
##                            (0.067)                 (0.064)                 (0.063)        
##                                                                                           
## payroll                                           0.049***                                
##                                                    (0.011)                                
##                                                                                           
## log(payroll)                                                              5.498***        
##                                                                            (1.113)        
##                                                                                           
## Constant                  60.350***               52.924***               33.747***       
##                            (3.931)                 (4.071)                 (6.531)        
##                                                                                           
## ------------------------------------------------------------------------------------------
## Observations                 180                     180                     180          
## R2                          0.038                   0.139                   0.155         
## Adjusted R2                 0.033                   0.129                   0.145         
## Residual Std. Error    6.636 (df = 178)       6.297 (df = 177)        6.238 (df = 177)    
## F Statistic         7.051*** (df = 1; 178) 14.259*** (df = 2; 177) 16.193*** (df = 2; 177)
## ==========================================================================================
## Note:                                                          *p<0.1; **p<0.05; ***p<0.01

9. Conclusion

As in the original sample paper, pay inequality is associated with lower team performance even after controlling for total payroll. The magnitude of the effect is that a one percentage point increase in top 20%’s share in payroll is associated with about 0.15 percentage point decrease in winning percentage. This seems a pretty modest effect - on average, it takes about a full percentage point change in the winning percentage to change a team’s standing.