Coding Empricial Analysis from Beginning to End


The purpose of this note is to illustate the use statistical code to do empirical analysis from beginning to end: from database retrieval, through cleaning and manipulating data, to analysis and display of empirical results. This note can be used as an introductory excercise of reproducible research.

While reproducible analysis can be done by keeping code and write-up of the analysis separate, new tools now exist that combine the two. RMarkdown for R users has been around for a while. The latest version of Stata (Stata 15) includes this functionality as well. It is called dynamic documents (dyndoc). I use Stata’s dyndoc to produce this note.

The empirical content of this note is motivated by the work of Herdon et al (2014) who famously discovered a “spreadsheet error” in the work by Reinhart and Rogoff (2010). Herdon et al (2014) overturn Reinhart and Roggof results that economic growth declines as public debt reaches levels above 90% of GDP. The sources of the different result include a miss-typed formula in an Excel spreadsheet and arbitrary exclusions of certain observations.

The “spreadsheet error” episode highlights the importance of careful data manipulation and thorough documentation of empirical analysis. Using Stata code to retrieve, manipulate and analyze data provides complete documentation of every step in the empirical analysis. (This is the so-called “soup-to-nuts” approach advocated by Project TIER.) Moreover, any researcher wishing to test the sensitivity of the results to alternative manipulations can do so by merely modifying the code.

Data retrieval

I use publicly available data from World Development Indicators (WDI). The advantage of using WDI is that the data is collected using consistent methodologies. The drawback is that WDI public debt data begins only in 1990 for most countries. Other researchers have put together much longer series by splicing data from a number of different sources. Some of this historical data is now available here.

Data from WDI can be retrieved directly using Stata command wbopendata. The command accesses the internet and retrieves the series listed in the indicator option. The names of series can be found here.

Running the ‘wbopendata’ takes a bit of time. Therefore, after I retrive the data once, I save it in a local directory. I then use the local file rather than retrieving the same data each time I run the code. Saving the retrieved data locally preserves the state of the database at the time of the retrieval.

. *clear
. *wbopendata, indicator(NY.GDP.MKTP.KD.ZG ; GC.DOD.TOTL.GD.ZS; NY.GDP.PCAP.KD; SP.POP.TOTL) clear long 
. *save wdidata08102017, replace

Cleaning data

The WDI contains data on ‘aggregates’ (e.g. High Income Countries). I drop these observations from the data. Prior to 1990, the WDI data on public debt is very sparse, so let’s delete any data prior to 1990. I also drop countries that at any point in the data (since 1990) had population of less than 500 thousand (countries like San Marino, Marshall Islands, etc.).

. use wdidata08102017, clear

. drop if region=="Aggregates" | region==""

. keep if year>=1990

. *find the minimum population for each country
. egen popmin=min(sp_pop_totl), by(countrycode)

. drop if popmin<500000

. drop popmin sp_pop_totl

Let’s give the variables more recognizable names.

. rename gc_dod_totl_gd_zs debttogdp

. rename ny_gdp_mktp_kd_zg gdpgrowth

. rename ny_gdp_pcap_kd gdppc

Let’s keep only observations for which none of the three variables are missing. Let’s also create log of GDP per capita - we will need it later in the analysis. Finally, let’s create a debt_category variable that would put levels of debt in different buckets.

. keep if debttogdp~=. & gdpgrowth~=. & gdppc~=.

. g loggdppc = log(gdppc)

. g str15 debt_cat="0-30%" if debttogdp<=30

. replace debt_cat="30-60%" if debttogdp>30 & debttogdp<=60

. replace debt_cat="60-90%" if debttogdp>60 & debttogdp<=90

. replace debt_cat="Above 90%" if debttogdp>90 & debttogdp~=.

Examining data I: Is debt associated with low growth?

We will create a box plot of real GDP growth by debt category (similar to Herdon et al’s Figure 3).

. graph box gdpgrowth , over(debt_cat) ytitle("Real GDP Growth (in %)") title("Contemporaneous relationship between debt and growth")

Despite using only recent data, our results are quite similar to those in Herdon et al (2014), i.e. they show very little relationship between debt levels and economic growth.

Examining data II: Does debt predict low growth?

The initial work by Reihart and Rogoff (2010) and Herdon et al (2014) was followed by a slew of papers that extended the analysis further. For example, Eberhardt and Presbitero (2015) look at the long-run effects of high debt, and Panizza and Presbitero (2014) control for endogeneity in the relationship between public debt and economic growth. In the spririt of that work, this section examines whether indebtedness predicts GDP growth over the next four years. This requires a bit more data manipulation.

Our strategy is to create five-year periods for each country. The first year in that period is the inital year and we will measure GDP growth over the next four years. Since each country enters the data set at different times, the five-year periods could begin in different years for different countries.

. *find when the country enters the data set 
. egen minyear=min(year), by(countrycode)

. *create a number for each five year period (0 for first five-year period, 1 for the second five-year period etc.)
. g fiveyearid = floor((year-minyear)/5)

. *create a number (0,1,2,4 or 5)for each year in each five-year period 
. g year_in_five = year-minyear-fiveyearid*5 

. save temp/data, replace
file temp/data.dta saved

Note that we put the ‘data’ file into a ‘temp’ folder. This is because we need ‘data’ need temporarily - it is not the final output of the analysis, or even the final output of this program.

Let’s now split the dataset into two: one that has the first year for every country and every five year period; and one that has the average GDP growth for the subsequent four years for every country and every five-year period. Then we merge the data sets back together.

. *keep only the first years in each five year period
. keep if year_in_five==0 

. keep countryname gdppc loggdppc debttogdp debt_cat fiveyearid

. save temp/first_years, replace

. use temp/data

. *keep only the subsequent years
. drop if year_in_five==0 

. keep countryname gdpgrowth fiveyearid

. *average across the four subsequent years
. collapse (mean) gdpgrowth (count) n=gdpgrowth, by(countryname fiveyearid)

. *merge back the inital debt
. merge 1:1 countryname fiveyearid using temp/first_years 

. *some observations will not match because may have first year without any subsequent years 
. *or since there are gaps in the data, we could have subsequent without initial
. drop if _merge~=3

. *also drop observations with fewer than two subsequent periods 
. drop if n<3

Now we are ready to analyze the data. Let’s plot inital debt levels against subsequent growth.

. graph box gdpgrowth ,over(debt_cat) ytitle("Average real GDP growth over next four years") title("Initial debt level and subsequent growth")

It appears that there is no significant relationship between initial debt and subsequent GDP growth. Let’s examine this more systematically using a regression. First, the descriptive statistics:

. *create descriptive statistics table
. tabstat gdpgrowth debttogdp gdppc ,statistics(mean median sd min max) columns(statistics) format(%9.1f)

    variable |      mean       p50        sd       min       max
   gdpgrowth |       3.2       3.1       2.8      -9.1      13.1
   debttogdp |      56.4      49.6      39.8       0.3     289.8
       gdppc |   18795.7    8872.8   19994.9     182.7   88519.6

Now, let’s display the results of the regressions in a table. (dyndoc does not work well with outreg so we use estout.)

. reg gdpgrowth debttogdp

. estimates store m1 

. reg gdpgrowth debttogdp gdppc

. estimates store m2  

. reg gdpgrowth debttogdp loggdppc

. estimates store m3

. estout m1 m2 m3, cells(b(star fmt(3)) se(par fmt(3))) legend stats(r2 N)

                       m1              m2              m3   
                     b/se            b/se            b/se   
debttogdp          -0.012**        -0.013**        -0.014** 
                  (0.004)         (0.004)         (0.004)   
gdppc                              -0.000***                
loggdppc                                           -0.506***
_cons               3.828***        4.766***        8.505***
                  (0.303)         (0.330)         (1.071)   
r2                  0.026           0.136           0.098   
N                 261.000         261.000         261.000   
* p<0.05, ** p<0.01, *** p<0.001

The regression results show statistically significant negative relationship between debt and subsequent growth. This is somewhat contradictory to the box plot we examined above, but consistent with the findings of Eberhardt and Presbitero (2015).


As described in Gentzkow and Shapiro (2014), doing empirical work involves writing a lot of code. Code makes analysis reproducible, less prone to errors, and easily extendable. This note introduced the concepts of retrieving, manipulating and analyzing data using Stata code. Although, the empirical results are just the first pass, they are meaningful, consistent with existing finding, and, importantly, can be used as a starting point for further analysis.


Eberhardt, M. & Presbitero, A. F. (2015), ‘Public debt and growth: Heterogeneity and non-linearity’, Journal of International Economics 97(1), 45–58.

Gentzkow, M. & Shapiro, J. M. (2014), ‘Code and data for the social sciences: A practitioner’s guide’, University of Chicago mimeo. Last updated January.

Herndon, T.; Ash, M. & Pollin, R. (2014), ‘Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff’, Cambridge journal of economics 38(2), 257–279.

Panizza, U. & Presbitero, A. F. (2014), ‘Public debt and economic growth: is there a causal effect?’, Journal of Macroeconomics 41, 21–41.

Reinhart, C. M. & Rogoff, K. S. (2010), ‘Growth in a Time of Debt’, American Economic Review 100(2), 573-78.