{
"cells": [
{
"cell_type": "markdown",
"id": "3eac964f",
"metadata": {},
"source": [
"# Machine Learning in Economics\n",
"\n",
"**Author**\n",
"\n",
"> - [Paul Schrimpf *UBC*](https://economics.ubc.ca/faculty-and-staff/paul-schrimpf/) \n",
"\n",
"\n",
"\n",
"**Prerequisites**\n",
"\n",
"- [Regression](https://datascience.quantecon.org/regression.html) \n",
"\n",
"\n",
"**Outcomes**\n",
"\n",
"- Understand how economists use machine learning in\n",
" academic research "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "02b7502d",
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"# Uncomment following line to install on colab\n",
"#! pip install fiona geopandas xgboost gensim folium pyLDAvis descartes"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f8b6fd2b",
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"\n",
"%matplotlib inline\n",
"#plt.style.use('tableau-colorblind10')\n",
"#plt.style.use('Solarize_Light2')\n",
"plt.style.use('bmh')"
]
},
{
"cell_type": "markdown",
"id": "c2334629",
"metadata": {},
"source": [
"## Introduction\n",
"\n",
"Machine learning is increasingly being utilized in economic\n",
"research. Here, we discuss three main ways that economists are\n",
"currently using machine learning methods."
]
},
{
"cell_type": "markdown",
"id": "1941a1c0",
"metadata": {},
"source": [
"## Prediction Policy\n",
"\n",
"Most empirical economic research focuses on questions of\n",
"causality. However, machine learning methods can actually be used\n",
"in economic research or policy making when the goal is prediction.\n",
"\n",
"[[mlKLMO15](#id13)] is a short paper which makes this point.\n",
"\n",
"> Consider two toy examples. One policymaker facing a drought must\n",
"decide whether to invest in a rain dance to increase the chance\n",
"of rain. Another seeing clouds must decide whether to take an\n",
"umbrella to work to avoid getting wet on the way home. Both\n",
"decisions could benefit from an empirical study of rain. But each\n",
"has differ-ent requirements of the estimator. One requires\n",
"causality: Do rain dances cause rain? The other does not, needing\n",
"only prediction: Is the chance of rain high enough to merit an\n",
"umbrella? We often focus on rain dance–like policy problems. But\n",
"there are also many umbrella-like policy problems. Not only are\n",
"these prediction problems neglected, machine learning can help\n",
"us solve them more effectively. [[mlKLMO15](#id13)].\n",
"\n",
"\n",
"One of their examples is the allocation of joint replacements for\n",
"osteoarthritis in elderly patients. Joint replacements are costly,\n",
"both monetarily and in terms of potentially painful recovery from\n",
"surgery. Joint replacements may not be worthwhile for patients who do\n",
"not live long enough afterward to enjoy the\n",
"benefits. [[mlKLMO15](#id13)] uses machine learning methods to\n",
"predict mortality and argues that avoiding joint replacements\n",
"for people with the highest predicted mortality risk could lead to\n",
"sizable benefits.\n",
"\n",
"Other situations where improved prediction could improve economic\n",
"policy include:\n",
"\n",
"- Targeting safety or health inspections. \n",
"- Predicting highest risk youth for targeting interventions. \n",
"- Improved risk scoring in insurance markets to reduce adverse\n",
" selection. \n",
"- Improved credit scoring to better allocate credit. \n",
"- Predicting the risk someone accused of a crime does not show up for\n",
" trial to help decide whether to offer bail [[mlKLL+17](#id12)]. \n",
"\n",
"\n",
"We investigated one such prediction policy problem in\n",
"[recidivism](https://datascience.quantecon.org/recidivism.html)."
]
},
{
"cell_type": "markdown",
"id": "9cf60b7a",
"metadata": {},
"source": [
"## Estimation of Nuisance Functions\n",
"\n",
"Most empirical economic studies are interested in a single low\n",
"dimensional parameter, but determining that parameter may require estimating additional\n",
"“nuisance” parameters to estimate this coefficient consistently and avoid omitted variables\n",
"bias. However, the choice of which other variables to include and\n",
"their functional forms is often somewhat arbitrary.\n",
"One promising idea is to use machine learning methods to let the data\n",
"decide what control variables to include and how. Care must be\n",
"taken when doing so, though, because machine learning’s flexibility and complexity\n",
"– what make it so good at prediction – also pose\n",
"challenges for inference."
]
},
{
"cell_type": "markdown",
"id": "f258d836",
"metadata": {},
"source": [
"### Partially Linear Regression\n",
"\n",
"To be more concrete, consider a regression model. We have some\n",
"regressor of interest, $ d $, and we want to estimate the effect of $ d $\n",
"on $ y $. We have a rich enough set of controls $ x $ that we are willing to\n",
"believe that $ E[\\epsilon|d,x] = 0 $ . $ d_i $ and $ y_i $ are scalars, while\n",
"$ x_i $ is a vector. We are not interested in $ x $ per se, but we need to\n",
"include it to avoid omitted variable bias. Suppose the true model\n",
"generating the data is:\n",
"\n",
"$$\n",
"y = \\theta d + f(x) + \\epsilon\n",
"$$\n",
"\n",
"where $ f(x) $ is some unknown function. This is called a\n",
"partially linear model: linear in $ d $, but not in\n",
"$ x $ .\n",
"\n",
"A typical applied econometric approach for this model would\n",
"be to choose some transform of $ x $, say $ X = T(x) $, where $ X $\n",
"could be some subset of $ x $ , perhaps along with interactions, powers, and\n",
"so on. Then, we estimate a linear regression,\n",
"\n",
"$$\n",
"y = \\theta d + X'\\beta + e\n",
"$$\n",
"\n",
"and then perhaps also report results for a handful of different\n",
"choices of $ T(x) $ .\n",
"\n",
"Some downsides to the typical applied econometric practice\n",
"include:\n",
"\n",
"- The choice of $ T $ is arbitrary, which opens the door to specification\n",
" searching and p-hacking. \n",
"- If $ x $ is high dimensional and $ X $ is low dimensional, a poor\n",
" choice will lead to omitted variable bias. Even if $ x $ is low\n",
" dimensional, omitted variable bias occurs if $ f(x) $ is poorly approximated by $ X'\\beta $. \n",
"\n",
"\n",
"In some sense, machine learning can be thought of as a way to\n",
"choose $ T $ in an automated and data-driven way. Choosing which machine learning method\n",
"to use and tuning parameters specifically for that method are still potentially arbitrary\n",
"decisions, but these decisions may have less impact.\n",
"\n",
"Economic researchers typically don’t just want an estimate of\n",
"$ \\theta $, $ \\hat{\\theta} $. Instead, they want to know that\n",
"$ \\hat{\\theta} $ has good statistical properties (it should at\n",
"least be consistent), and they want some way to quantify how uncertain is\n",
"$ \\hat{\\theta} $ (i.e. they want a standard error). The complexity\n",
"of machine learning methods makes their statistical properties\n",
"difficult to understand. If we want $ \\hat{\\theta} $ to have\n",
"known and good statistical properties, we must make sure we use machine\n",
"learning methods correctly. A procedure to estimate\n",
"$ \\theta $ in the partially linear model is as follows:\n",
"\n",
"1. Predict $ y $ and $ d $ from $ x $ using any machine\n",
" learning method with “cross-fitting”. \n",
" - Partition the data in $ k $ subsets. \n",
" - For the $ j $ th subset, train models to predict $ y $ and $ d $\n",
" using the other $ k-1 $ subsets. Denote the predictions from\n",
" these models as $ p^y_{-j}(x) $ and $ p^d_{-j}(x) $. \n",
" - For $ y_i $ in the $ j $ -th subset use the other\n",
" $ k-1 $ subsets to predict $ \\hat{y}_i = p^y_{-j(i)}(x_i) $ \n",
"1. Partial out $ x $ : let $ \\tilde{y}_i = y_i - \\hat{y}_i $\n",
" and $ \\tilde{d}_i = d_i - \\hat{d}_i $. \n",
"1. Regress $ \\tilde{y}_i $ on $ \\tilde{d}_i $, let\n",
" $ \\hat{\\theta} $ be the estimated coefficient for\n",
" $ \\tilde{d}_i $ . $ \\hat{\\theta} $ is consistent,\n",
" asymptotically normal, and has the usual standard error (i.e. the\n",
" standard error given by statsmodels is correct). \n",
"\n",
"\n",
"Some remarks:\n",
"\n",
"- This procedure gives a $ \\hat{\\theta} $ that has the same\n",
" asymptotic distribution as what we would get if we knew the true\n",
" $ f(x) $ . In statistics, we call this an oracle property,\n",
" because it is as if an all knowing oracle told us $ f(x) $. \n",
"- This procedure requires some technical conditions on the data-generating\n",
" process and machine learning estimator, but we will not worry about them here. See\n",
" [[mlCCD+18](#id14)] for details. \n",
"\n",
"\n",
"Here is code implementing the above idea."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93be6840",
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_val_predict\n",
"from sklearn import linear_model\n",
"import statsmodels as sm\n",
"\n",
"def partial_linear(y, d, X, yestimator, destimator, folds=3):\n",
" \"\"\"Estimate the partially linear model y = d*C + f(x) + e\n",
"\n",
" Parameters\n",
" ----------\n",
" y : array_like\n",
" vector of outcomes\n",
" d : array_like\n",
" vector or matrix of regressors of interest\n",
" X : array_like\n",
" matrix of controls\n",
" mlestimate : Estimator object for partialling out X. Must have ‘fit’\n",
" and ‘predict’ methods.\n",
" folds : int\n",
" Number of folds for cross-fitting\n",
"\n",
" Returns\n",
" -------\n",
" ols : statsmodels regression results containing estimate of coefficient on d.\n",
" yhat : cross-fitted predictions of y\n",
" dhat : cross-fitted predictions of d\n",
" \"\"\"\n",
"\n",
" # we want predicted probabilities if y or d is discrete\n",
" ymethod = \"predict\" if False==getattr(yestimator, \"predict_proba\",False) else \"predict_proba\"\n",
" dmethod = \"predict\" if False==getattr(destimator, \"predict_proba\",False) else \"predict_proba\"\n",
" # get the predictions\n",
" yhat = cross_val_predict(yestimator,X,y,cv=folds,method=ymethod)\n",
" dhat = cross_val_predict(destimator,X,d,cv=folds,method=dmethod)\n",
" ey = np.array(y - yhat)\n",
" ed = np.array(d - dhat)\n",
" ols = sm.regression.linear_model.OLS(ey,ed).fit(cov_type='HC0')\n",
"\n",
" return(ols, yhat, dhat)"
]
},
{
"cell_type": "markdown",
"id": "babdbf65",
"metadata": {},
"source": [
"### Application: Gender Wage Gap\n",
"\n",
"Okay, enough theory. Let’s look at an application. Policy makers have\n",
"long been concerned with the gender wage gap. We will examine the\n",
"gender wage gap using data from the 2018 Current Population Survey (CPS) in\n",
"the US. In particular, we will use the version of the [CPS provided by\n",
"the NBER](https://www.nber.org/cps/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bd1d644d",
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"import statsmodels.formula.api as smf\n",
"from statsmodels.iolib.summary2 import summary_col\n",
"# Download CPS data\n",
"cpsall = pd.read_stata(\"https://www.nber.org/morg/annual/morg18.dta\")\n",
"\n",
"# take subset of data just to reduce computation time\n",
"cps = cpsall.sample(30000, replace=False, random_state=0)\n",
"display(cps.head())\n",
"cps.describe()"
]
},
{
"cell_type": "markdown",
"id": "3a0d9812",
"metadata": {},
"source": [
"The variable “earnwke” records weekly earnings. Two\n",
"variables detail the hours of work. “uhours” is usual hours worked per\n",
"week, and “hourslw” is hours worked last week. We will try using each\n",
"measure of hours to construct the wage. Let’s estimate the\n",
"unconditional gender earnings and wage gaps."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "84ae86ce",
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"cps[\"female\"] = (cps.sex==2)\n",
"cps[\"log_earn\"] = np.log(cps.earnwke)\n",
"cps[\"log_earn\"][np.isinf(cps.log_earn)] = np.nan\n",
"cps[\"log_uhours\"] = np.log(cps.uhourse)\n",
"cps[\"log_uhours\"][np.isinf(cps.log_uhours)] = np.nan\n",
"cps[\"log_hourslw\"] = np.log(cps.hourslw)\n",
"cps[\"log_hourslw\"][np.isinf(cps.log_hourslw)] = np.nan\n",
"cps[\"log_wageu\"] = cps.log_earn - cps.log_uhours\n",
"cps[\"log_wagelw\"] = cps.log_earn - cps.log_hourslw\n",
"\n",
"\n",
"lm = list()\n",
"lm.append(smf.ols(formula=\"log_earn ~ female\", data=cps,\n",
" missing=\"drop\").fit(cov_type='HC0'))\n",
"lm.append( smf.ols(formula=\"log_wageu ~ female\", data=cps,\n",
" missing=\"drop\").fit(cov_type='HC0'))\n",
"lm.append(smf.ols(formula=\"log_wagelw ~ female\", data=cps,\n",
" missing=\"drop\").fit(cov_type='HC0'))\n",
"lm.append(smf.ols(formula=\"log_earn ~ female + log_hourslw + log_uhours\", data=cps,\n",
" missing=\"drop\").fit(cov_type='HC0'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0dc7c61c",
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"summary_col(lm, stars=True)"
]
},
{
"cell_type": "markdown",
"id": "2342b830",
"metadata": {},
"source": [
"The unconditional gender gap in log earnings is about -0.3. Women earn\n",
"about 30% less than men. The unconditional gender wage gap is about\n",
"18%. The last column gives the gender earnings gap conditional on\n",
"hours. This could differ from the wage gap if, for example, full time\n",
"workers are paid higher wages than part-time. Some evidence\n",
"has suggested this, and the gender earnings gap conditional on hours is about\n",
"15%."
]
},
{
"cell_type": "markdown",
"id": "25f2c524",
"metadata": {},
"source": [
"#### Equal Pay for Equal Work?\n",
"\n",
"A common slogan is equal pay for equal work. One way to interpret this\n",
"is that for employees with similar worker and job characteristics, no gender wage gap should exist.\n",
"\n",
"Eliminating this wage gap across similar worker and job characteristics is one necessary\n",
"(though insufficient) condition for equality. Some of the differences (like hours in the 4th column of the above table)\n",
"might actually be a result of societal norms and/or\n",
"discrimination, so we don’t want to overgeneralize.\n",
"\n",
"Nonetheless, let’s examine whether there is a gender wage gap\n",
"conditional on all worker and job characteristics. We want to ensure\n",
"that we control for those characteristics as flexibly as\n",
"possible, so we will use the partially linear model described above."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f73a728f",
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"from patsy import dmatrices\n",
"# Prepare data\n",
"fmla = \"log_earn + female ~ log_uhours + log_hourslw + age + I(age**2) + C(race) + C(cbsafips) + C(smsastat) + C(grade92) + C(unionmme) + C(unioncov) + C(ind02) + C(occ2012)\"\n",
"yd, X = dmatrices(fmla,cps)\n",
"female = yd[:,1]\n",
"logearn = yd[:,2]"
]
},
{
"cell_type": "markdown",
"id": "478f8f35",
"metadata": {},
"source": [
"The choice of regularization parameter is somewhat\n",
"tricky. Cross-validation is a good way to choose the best\n",
"regularization parameter when our goal is prediction. However,\n",
"our goal here is not prediction. Instead, we want to get a well-behaved\n",
"estimate of the gender wage gap. To achieve this, we generally need a\n",
"smaller regularization parameter than what would minimize\n",
"MSE. The following code picks such a regularization parameter. Note\n",
"however, that the details of this code might not meet the technical\n",
"theoretical conditions needed for our ultimate estimate of the gender\n",
"wage gap to be consistent and asymptotically normal. The R package,\n",
"“HDM” [[mlCHS16](#id21)] , chooses the regularization parameter in way that\n",
"is known to be correct."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "99d44c24",
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"# select regularization parameter\n",
"alphas = np.exp(np.linspace(-2, -12, 25))\n",
"lassoy = linear_model.LassoCV(cv=6, alphas=alphas, max_iter=5000).fit(X,logearn)\n",
"lassod = linear_model.LassoCV(cv=6, alphas=alphas, max_iter=5000).fit(X,female)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a4ac969",
"metadata": {
"hide-output": false
},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1,2)\n",
"\n",
"def plotlassocv(l, ax) :\n",
" alphas = l.alphas_\n",
" mse = l.mse_path_.mean(axis=1)\n",
" std_error = l.mse_path_.std(axis=1)\n",
" ax.plot(alphas,mse)\n",
" ax.fill_between(alphas, mse + std_error, mse - std_error, alpha=0.2)\n",
"\n",
" ax.set_ylabel('MSE +/- std error')\n",
" ax.set_xlabel('alpha')\n",
" ax.set_xlim([alphas[0], alphas[-1]])\n",
" ax.set_xscale(\"log\")\n",
" return(ax)\n",
"\n",
"ax[0] = plotlassocv(lassoy,ax[0])\n",
"ax[0].set_title(\"MSE for log(earn)\")\n",
"ax[1] = plotlassocv(lassod,ax[1])\n",
"ax[1].set_title(\"MSE for female\")\n",
"fig.tight_layout()\n",
"\n",
"# there are theoretical reasons to choose a smaller regularization\n",
"# than the one that minimizes cv. BUT THIS WAY OF CHOOSING IS ARBITRARY AND MIGHT BE WRONG\n",
"def pickalpha(lassocv) :\n",
" imin = np.argmin(lassocv.mse_path_.mean(axis=1))\n",
" msemin = lassocv.mse_path_.mean(axis=1)[imin]\n",
" se = lassocv.mse_path_.std(axis=1)[imin]\n",
" alpha= min([alpha for (alpha, mse) in zip(lassocv.alphas_, lassocv.mse_path_.mean(axis=1)) if mse\n",
"\\[mlAI16\\] Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. *Proceedings of the National Academy of Sciences*, 113(27):7353–7360, 2016. URL: [http://www.pnas.org/content/113/27/7353](http://www.pnas.org/content/113/27/7353), [arXiv:http://www.pnas.org/content/113/27/7353.full.pdf](https://arxiv.org/abs/http://www.pnas.org/content/113/27/7353.full.pdf), [doi:10.1073/pnas.1510489113](https://doi.org/10.1073/pnas.1510489113).\n",
"\n",
"\n",
"\\[mlCCD+18\\] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. *The Econometrics Journal*, 21(1):C1–C68, 2018. URL: [https://onlinelibrary.wiley.com/doi/abs/10.1111/ectj.12097](https://onlinelibrary.wiley.com/doi/abs/10.1111/ectj.12097), [arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/ectj.12097](https://arxiv.org/abs/https://onlinelibrary.wiley.com/doi/pdf/10.1111/ectj.12097), [doi:10.1111/ectj.12097](https://doi.org/10.1111/ectj.12097).\n",
"\n",
"\n",
"\\[mlCDDFV18\\] Victor Chernozhukov, Mert Demirer, Esther Duflo, and Iván Fernández-Val. Generic machine learning inference on heterogenous treatment effects in randomized experimentsxo. Working Paper 24678, National Bureau of Economic Research, June 2018. URL: [http://www.nber.org/papers/w24678](http://www.nber.org/papers/w24678), [doi:10.3386/w24678](https://doi.org/10.3386/w24678).\n",
"\n",
"\n",
"\\[mlCHS16\\] Victor Chernozhukov, Chris Hansen, and Martin Spindler. hdm: high-dimensional metrics. *R Journal*, 8(2):185–199, 2016. URL: [https://journal.r-project.org/archive/2016/RJ-2016-040/index.html](https://journal.r-project.org/archive/2016/RJ-2016-040/index.html).\n",
"\n",
"\n",
"\\[mlKLL+17\\] Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. Human Decisions and Machine Predictions*. *The Quarterly Journal of Economics*, 133(1):237–293, 08 2017. URL: [https://dx.doi.org/10.1093/qje/qjx032](https://dx.doi.org/10.1093/qje/qjx032), [arXiv:http://oup.prod.sis.lan/qje/article-pdf/133/1/237/24246094/qjx032.pdf](https://arxiv.org/abs/http://oup.prod.sis.lan/qje/article-pdf/133/1/237/24246094/qjx032.pdf), [doi:10.1093/qje/qjx032](https://doi.org/10.1093/qje/qjx032).\n",
"\n",
"\n",
"\\[mlKLMO15\\] Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer. Prediction policy problems. *American Economic Review*, 105(5):491–95, May 2015. URL: [http://www.aeaweb.org/articles?id=10.1257/aer.p20151023](http://www.aeaweb.org/articles?id=10.1257/aer.p20151023), [doi:10.1257/aer.p20151023](https://doi.org/10.1257/aer.p20151023).\n",
"\n",
"\n",
"\\[mlWA18\\] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. *Journal of the American Statistical Association*, 0(0):1–15, 2018. URL: [https://doi.org/10.1080/01621459.2017.1319839](https://doi.org/10.1080/01621459.2017.1319839), [arXiv:https://doi.org/10.1080/01621459.2017.1319839](https://arxiv.org/abs/https://doi.org/10.1080/01621459.2017.1319839), [doi:10.1080/01621459.2017.1319839](https://doi.org/10.1080/01621459.2017.1319839)."
]
}
],
"metadata": {
"date": 1662089819.479312,
"filename": "ml_in_economics.md",
"kernelspec": {
"display_name": "Python",
"language": "python3",
"name": "python3"
},
"title": "Machine Learning in Economics"
},
"nbformat": 4,
"nbformat_minor": 5
}