Introduction

The purpose of this note is to illustrate the use statistical code to do empirical analysis from beginning to end: from database retrieval, through cleaning and manipulating data, to analysis and display of empirical results. This note can be used as an introductory exercise of reproducible research.

While reproducible analysis can be done by keeping code and write-up of the analysis separate, new tools now exist that combine the two. One such tool is RMarkdown which I use to create this note. There is a companion document that performs the same analysis and creates a similar document using dynamic documents (dyndoc) available in the latest version of Stata (Stata 15).

The empirical content of this note is motivated by the work of Herndon et al (2014) who famously discovered a “spreadsheet error” in the work by Reinhart and Rogoff (2010). Herndon et al (2014) overturn Reinhart and Rogoff results that economic growth declines as public debt reaches levels above 90% of GDP. The sources of the different result include a miss-typed formula in an Excel spreadsheet and arbitrary exclusions of certain observations.

The “spreadsheet error” episode highlights the importance of careful data manipulation and thorough documentation of empirical analysis. Using Stata code to retrieve, manipulate and analyze data provides complete documentation of every step in the empirical analysis. (This is the so-called “soup-to-nuts” approach advocated by Project TIER.) Moreover, any researcher wishing to test the sensitivity of the results to alternative manipulations can do so by merely modifying the code.

Retrieving data

I use publicly available data from World Development Indicators (WDI). The advantage of using WDI is that the data is collected using consistent methodologies. The drawback is that WDI public debt data begins only in 1990 for most countries. Other researchers have put together much longer series by splicing data from a number of different sources. Some of this historical data is now available here.

Data from WDI can be retrieved directly using WDI package. Function WDI accesses the internet and retrieves the series listed in the indicator option. The names of series can be found here.

Running the WDI takes a bit of time. Therefore, after I retrieve the data once, I save it in a local directory. I then use the local file rather than retrieving the same data each time I run the code. Saving the retrieved data locally preserves the state of the database at the time of the retrieval.

library(tidyverse)
library(WDI)
library(stargazer)

#wdi <- WDI(country = "all", start=1960, end=2017, extra="TRUE",
#           indicator=c("NY.GDP.MKTP.KD.ZG","GC.DOD.TOTL.GD.ZS" , "NY.GDP.PCAP.KD", "SP.POP.TOTL"))
#write_csv(wdi,"wdi08142017.csv") #save the retrieved data
wdi <- read_csv("wdi08142017.csv")

Cleaning data

The WDI contains data on ‘aggregates’ (e.g. High Income Countries). I drop these observations from the data. I also drop countries that at any point in the data population of less than 100 thousand (countries like San Marino, Marshall Islands, etc.).

#rename the variables more recognizable names
wdi <- rename(wdi, gdppc = NY.GDP.PCAP.KD, debttogdp = GC.DOD.TOTL.GD.ZS, gdpgrowth = NY.GDP.MKTP.KD.ZG, pop = SP.POP.TOTL)
#delete the 'Aggregates' so that we only have countries
wdi <- wdi[wdi$region != "Aggregates",]
#keep only the variables we're going to use
wdi <- select(wdi, debttogdp, gdpgrowth, gdppc, year, country, pop)
#the WDI data on public debt is very sparse, so let's delete any data prior to 1990
wdi <- filter(wdi, year>=1990)
#drop countries that at any point population below 500K
#find min population for each country
wdi <- wdi %>% group_by(country) %>% mutate(minpop=min(pop)) %>% filter(minpop>=500000) %>% select(-minpop,-pop)
#keep only observations for which we have no missing values
wdi <- wdi[!is.na(wdi$debttogdp), ]
wdi <- wdi[!is.na(wdi$gdpgrowth), ]
wdi <- wdi[!is.na(wdi$gdppc), ]
#create a log of GDP per capita in case we need it later int he analysis
wdi$loggdppc <- log(wdi$gdppc)

#create debt categories
wdi$debtcat <- ifelse(wdi$debttogdp <= 30, "0-30%",
                       ifelse(wdi$debttogdp <= 60, "30-60%",
                              ifelse(wdi$debttogdp <= 90 , "60-90%", "Above 90%")))

Analyzing data

Is debt associated with low growth?

We will create a box plot of real GDP growth by debt category (similar to Herndon et al’s Figure 3).

ggplot(wdi,aes(x = factor(debtcat),y = gdpgrowth)) + geom_boxplot() +
  labs(y="Real GDP growth (in %)", x="Debt (as % of GDP)",
       title="Contemporaneous relationship between debt and growth")

Despite using only recent data, our results are quite similar to those in Herndon et al (2014), i.e. they show very little relationship between debt levels and economic growth.

Does debt predict low growth?

The initial work by Reihart and Rogoff (2010) and Herndon et al (2014) was followed by a slew of papers that extended the analysis further. For example, Eberhardt and Presbitero (2015) look at the long-run effects of high debt, and Panizza and Presbitero (2014) control for endogeneity in the relationship between public debt and economic growth. In the spirit of that work, this section examines whether indebtedness predicts GDP growth over the next four years. This requires a bit more data manipulation.

Our strategy is to create five-year periods for each country. The first year in that period is the initial year and we will measure GDP growth over the next four years. Since each country enters the data set at different times, the five-year periods could begin in different years for different countries.

#find when the country enters the data set 
wdi <- wdi %>% group_by(country) %>% mutate(minyear=min(year))
#create a number for each five year period (0 for first five-year period, 1 for the second five-year period etc.)
wdi <- mutate(wdi,fiveyearid=floor((year-minyear)/5))  
#create a number (0,1,2,4 or 5)for each year in each five-year period 
wdi <- mutate(wdi,yearinfive=year-minyear-fiveyearid*5)  

Let’s now split the data set into two: one that has the first year for every country and every five year period; and one that has the average GDP growth for the subsequent four years for every country and every five-year period. Then we merge the data sets back together.

firstyear <- wdi %>% filter(yearinfive==0) %>% select(debttogdp,debtcat,country,fiveyearid, gdppc, loggdppc)
sbsqntyear <- wdi %>% filter(yearinfive!=0) %>% select(gdpgrowth,country,fiveyearid) 
#average across the four subsequent years
sbsqntyear <- sbsqntyear %>% group_by(country, fiveyearid) %>% summarize(avggrowth=mean(gdpgrowth), n=n())
sbsqntyear <- filter(sbsqntyear,n>2)
#merge first and subsequent years together using inner_join since
#some observations will not match because may have first year without any subsequent years 
#or since there are gaps in the data, we could have subsequent without initial
wdi5 <- inner_join(firstyear,sbsqntyear, by=c("country","fiveyearid"))

#wdi needs to be dataframe for stargazer to work
wdi5 <- data.frame(wdi5)

Now we are ready to analyze the data. Let’s plot initial debt levels against subsequent growth.

ggplot(wdi5, aes(x=debtcat,y=avggrowth)) + geom_boxplot() +
  labs(y="Average real GDP over subsequent four years", x="Initial debt (as % of GDP)",
       title="Initial debt level and subsequent growth")

It appears that there is no significant relationship between initial debt and subsequent GDP growth. Let’s examine this more systematically using a regression. First, the descriptive statistics:

stargazer(wdi5[c("avggrowth", "debttogdp", "gdppc")], type = "text" , digits=1)
## 
## ==============================================
## Statistic  N    Mean   St. Dev.  Min    Max   
## ----------------------------------------------
## avggrowth 261   3.2      2.8    -9.1    13.1  
## debttogdp 261   56.4     39.8    0.3   289.8  
## gdppc     261 18,795.7 19,994.9 182.7 88,519.6
## ----------------------------------------------

Let’s estimate some regressions.

r1 <- lm(avggrowth ~ debttogdp, data = wdi5)
r2 <- lm(avggrowth ~ debttogdp + gdppc, data = wdi5)
r3 <- lm(avggrowth ~ debttogdp + loggdppc, data = wdi5)

And show the results in a nice table:

stargazer(r1, r2,r3, type = "html", keep.stat = c("n","rsq"), intercept.bottom = FALSE)
Dependent variable:
avggrowth
(1) (2) (3)
Constant 3.828*** 4.766*** 8.505***
(0.303) (0.330) (1.071)
debttogdp -0.012*** -0.013*** -0.014***
(0.004) (0.004) (0.004)
gdppc -0.00005***
(0.00001)
loggdppc -0.506***
(0.111)
Observations 261 261 261
R2 0.026 0.136 0.098
Note: p<0.1; p<0.05; p<0.01

The regression results show statistically significant negative relationship between debt and subsequent growth. This is somewhat contradictory to the box plot we examined above, but consistent with the findings of Eberhardt and Presbitero (2015).

Conclusion

As described in Gentzkow and Shapiro (2014), doing empirical work involves writing a lot of code. Code makes analysis reproducible, less prone to errors, and easily extendable. This note introduced the concepts of retrieving, manipulating and analyzing data using Stata code. Although, the empirical results are just the first pass, they are meaningful, consistent with existing finding, and, importantly, can be used as a starting point for further analysis.

Exercises

  1. How sensitive are the regression results to including certain countries? Create a scatter plot of initial debt-to-GDP ratio against subsequent growth. Is there a particular country that seems to drive the negative relationship? Re-estimate the regression without that country.

  2. Are the results sensitive to the number of subsequent years required to be included in the analysis?

  3. Does it matter if the small-country filter is included before or after we drop the prior 1990 data?

  4. Does average debt over two years predict GDP over the next three years?

  5. Does average debt over five years predict GDP over the next five years?

References

Eberhardt, Markus, and Andrea F Presbitero. 2015. “Public Debt and Growth: Heterogeneity and Non-Linearity.” Journal of International Economics 97 (1). Elsevier: 45–58.

Gentzkow, Matthew, and Jesse M Shapiro. 2014. “Code and Data for the Social Sciences: A Practitioner’s Guide.” University of Chicago Mimeo. Last Updated January.

Herndon, Thomas, Michael Ash, and Robert Pollin. 2014. “Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff.” Cambridge Journal of Economics 38 (2). Oxford University Press: 257–79.

Panizza, Ugo, and Andrea F Presbitero. 2014. “Public Debt and Economic Growth: Is There a Causal Effect?” Journal of Macroeconomics 41. Elsevier: 21–41.

Reinhart, Carmen M., and Kenneth S. Rogoff. 2010. “Growth in a Time of Debt.” American Economic Review 100 (2): 573–78. doi:10.1257/aer.100.2.573.