C26 | Macro Paper Warehouse

A Robust Test for Weak Instruments for 2SLS with Multiple Endogenous Regressors

Mon, 01 Jan 0001 00:00:00 +0000

This paper develops a test for instrument strength based on the bias of two-stage least squares (2SLS) that: (1) generalizes the Stock-Yogo (2005) and Sanderson-Windmeijer (2016) tests to be robust to heteroskedasticity and autocorrelation (HAC), and (2) extends the Montiel Olea-Pflueger (2013) robust test from models with a single endogenous regressor to models with multiple endogenous regressors—the important remaining gap identified by Andrews et al. (2019). The test is based on a weighted quadratic loss in the asymptotic bias of 2SLS and can use either the Stock-Yogo absolute bias criterion or the 2SLS bias relative to Montiel Olea-Pflueger’s worst-case benchmark. Extensions are developed to test whether instruments are weak for individual 2SLS coefficients. In simulations, the test controls size and is powerful, and the authors provide efficient code packages. The test is applied to state-dependent fiscal multipliers (Ramey-Zubairy 2018).

Summary of a forthcoming paper, AI-assisted and human-reviewed. See the linked original for the authoritative claims and full conditions.

In depth

Q1. What is the key gap in the existing weak instrument testing literature that this paper fills?

The key gap is the absence of a test for weak instruments that is both HAC robust and applicable to models with multiple endogenous regressors. Stock-Yogo (2005) requires conditionally homoskedastic and serially uncorrelated (CHSU) errors. Montiel Olea-Pflueger (2013) introduced a HAC-robust effective F-statistic for a single endogenous regressor but their test does not extend to multiple regressors. Sanderson-Windmeijer (2016) addressed multiple endogenous regressors but retained the CHSU assumption. This paper combines HAC robustness with multiple-regressor generality, filling the gap Andrews et al. (2019) identify as the most important remaining open problem in the literature.

Q2. What is the test statistic and what are its two bias criteria?

The test statistic is based on a weighted quadratic loss in the asymptotic bias of the 2SLS estimates when first-stage coefficients are close to zero, with two criteria: (i) the absolute bias criterion of Stock-Yogo (2005)—the 2SLS bias relative to the maximum OLS bias; and (ii) the 2SLS bias relative to Montiel Olea-Pflueger’s (2013) worst-case benchmark. The test accommodates both the Stock-Yogo setting (instruments weak because the first-stage coefficient matrix is near rank zero) and the Sanderson-Windmeijer setting (instruments weak because the first-stage coefficient matrix is near having a rank reduction of one rather than near rank zero).

Q3. What extensions are provided for individual coefficient testing?

Extensions are developed to test whether instruments are weak for individual 2SLS coefficients, by applying the test to a transformed regression that isolates the coefficient of interest, accommodating the Sanderson-Windmeijer (2016) setting in which one regressor is locally under-identified while others may not be. This is important in practice because researchers with multiple endogenous regressors often care about whether instruments are weak for each coefficient separately, not just for the system as a whole; the extension provides a formal basis for this common applied practice.

Q4. What does the empirical application show?

The paper demonstrates the testing procedures in the context of estimating state-dependent fiscal multipliers as in Ramey and Zubairy (2018), where the two endogenous regressors are lagged spending interacted with a state variable (recession/expansion indicator), illustrating both the implementation of the test and how inference differs from relying on CHSU-based critical values. In simulations, the test controls size accurately and is powerful against alternatives where instruments are strong, providing a reliable and practically useful tool with efficient code packages distributed for applied researchers.

Key concepts

weak instruments test : a test assessing whether the first-stage regression is sufficiently strong to make 2SLS inference reliable; based on the maximum bias of 2SLS relative to a benchmark; weak instruments cause 2SLS to inherit the bias of OLS. HAC robustness : robustness to heteroskedasticity and autocorrelation; absent from Stock-Yogo (2005), meaning researchers who use their critical values while allowing for HAC errors in second-stage inference apply mismatched validity assumptions. effective F-statistic : the statistic introduced by Montiel Olea and Pflueger (2013) for HAC-robust weak instruments testing with a single endogenous regressor; generalized in this paper to the multiple-regressor setting. absolute bias criterion : the criterion that the 2SLS relative bias (standardized absolute bias) is below a threshold; equivalently, the 2SLS bias as a proportion of the maximum OLS bias; defined by Stock-Yogo (2005) and generalized here to the HAC-robust multi-instrument setting.

When is TSLS Actually LATE?

Mon, 01 Jan 0001 00:00:00 +0000

This paper asks: when does two-stage least squares (TSLS) with covariates actually estimate a local average treatment effect (LATE) — a non-negatively weighted average of causal effects for compliers only? The authors show that the answer is: almost never in practice.

The paper’s central theoretical result (Theorem 1) is that a linear IV estimand is weakly causal — meaning it cannot have the wrong sign relative to all underlying treatment effects — if and only if the IV specification has “rich covariates,” defined as the condition that the linear projection of the instrument onto the covariates, L[Z|X], equals the true conditional mean E[Z|X] at every covariate value. Saturated specifications (nonparametric covariate control) always satisfy rich covariates. Outside of two special cases — saturated covariates or an instrument that is mean-independent of covariates — rich covariates is an implicit parametric assumption that can fail.

When rich covariates fails, the TSLS estimand is “level dependent”: it depends not only on treatment effects for compliers but also on the levels of potential outcomes for always-takers and never-takers, some of which receive negative weight. The problem arises mechanically because the numerator of the IV estimand, E[Y Z̃], contains a term E[E[Y|X] E[Z̃|X]] that reflects untreated-outcome levels rather than causal contrasts. This term vanishes only when E[Z̃|X] = E[Z|X] − L[Z|X] = 0, i.e., rich covariates.

To document how common this failure is in practice, the authors surveyed 122 empirical IV papers published in five top economics journals (JPE, AER, QJE, ReStud, Econometrica) between January 2000 and October 2018. Of the 99 papers using TSLS with covariates, only 5 used a saturated specification at any point and only 1 (Chamberlain and Imbens 2004) used saturated specifications exclusively. Nearly a third of TSLS-with-covariates papers explicitly invoked the LATE interpretation; none reported a test of rich covariates.

The paper applies these findings to thirteen empirical studies. In Card (1995), the original IV estimate of returns to education is 0.132; the Ramsey RESET test overwhelmingly rejects rich covariates, and a DDML estimate of the weakly causal quantity β_rich is modestly smaller, with a relative specification bias of roughly 8% and the gap between β_iv and β_rich representing about 21% of the OLS–IV gap. In Nunn and Wantchekon (2011), the IV estimate of the slave trade’s effect on trust is nearly four times as large as the DDML estimate; after reestimation, the null of no effect would not be rejected at conventional significance levels. In Dube and Harish (2020), the DDML estimate of β_rich is about 20% smaller than the original IV estimate (roughly 40% of the OLS–IV gap) and is no longer significantly different from zero at conventional levels.

The paper also shows that Abadie’s (2003) kappa-weighting approach fails under the same necessary condition: it is weakly causal if and only if rich covariates holds, at which point it is numerically identical to standard IV — leaving no reason to use it. Monte Carlo simulations calibrated to Card (1995) show that saturated specifications can exhibit substantial finite-sample bias when the covariate support is large relative to the sample, while DDML partially linear IV (PLIV) converges to β_rich with decreasing bias as sample size grows.

The authors conclude that two conditions are jointly necessary for TSLS to be interpretable as a non-negatively weighted average of LATEs: (i) rich covariates, and (ii) a first-stage flexible enough to capture any covariate-varying direction of monotonicity. Both conditions fail routinely in published work. The recommended alternatives are: DDML PLIV for estimating β_rich (a weakly causal weighted average of conditional LATEs), or instrument propensity score weighting / Abadie kappa with correctly estimated E[Z|X] for estimating the unconditional ACR/LATE. The Ramsey RESET test is offered as a practical diagnostic for rich covariates violations, and it detected sizable discrepancies in each of the thirteen applications examined.

Q: What is the paper’s central theoretical result? A: Theorem 1 establishes that, given conditional exogeneity and monotonicity, the linear IV estimand β_iv is weakly causal — i.e., cannot systematically misrepresent the sign of treatment effects — if and only if the IV specification has rich covariates (L[Z|X] = E[Z|X] for every covariate value x). Rich covariates is therefore simultaneously sufficient and necessary; the sufficient direction was a special case of Kolesar (2013), while the necessary direction is novel to this paper.

Q: What does “rich covariates” mean and when is it satisfied? A: Rich covariates means that the linear projection of the instrument onto the included covariates exactly reproduces the instrument’s true conditional mean at every point in the covariate support. It is automatically satisfied in two cases: when covariates are specified saturatedly (with an indicator for each covariate cell), or when the instrument is mean-independent of all covariates so E[Z|X] is a constant. Outside these cases, rich covariates is an implicit parametric functional form assumption.

Q: What goes wrong when rich covariates fails? A: When L[Z|X] ≠ E[Z|X], the IV estimand becomes “level dependent”: it depends not only on treatment effects (causal contrasts) but also on the levels of potential outcomes for always-takers and never-takers. Because always-takers always receive Y(1) and never-takers always receive Y(0), the estimand picks up these levels through the term E[E[Y|X] E[Z̃|X]], which is nonzero whenever E[Z̃|X] = E[Z|X] − L[Z|X] ≠ 0. This can cause β_iv to be negative even when all complier and always-taker treatment effects are positive.

Q: How is the paper’s critique different from the two-way fixed effects (TWFE) literature? A: The TWFE literature (Goodman-Bacon 2021; Sun and Abraham 2021) identifies negative-weight problems arising from heterogeneous treatment effects due to cohort timing, but those estimands are not level dependent. By contrast, the TSLS problems identified here involve level dependence and persist even under constant, homogeneous treatment effects (Proposition 5), making the critique more fundamental and harder to dismiss by assuming effect homogeneity.

Q: Does the problem disappear if treatment effects are constant? A: No. Proposition 5 shows that rich covariates remains necessary for β_iv to be weakly causal even under Assumption CLE (constant, linear treatment effects). Level dependence occurs whenever E[Z̃|X] ≠ 0, regardless of effect heterogeneity. The only additional assumption that can substitute is Assumption LIN (linear potential outcome means), which together with constant effects implies β_iv = Δ exactly (Proposition 6), but this combination is a strong parametric restriction.

Q: What does the survey of empirical papers find? A: Of 122 IV papers in top journals from 2000–2018, 112 used TSLS, and 99 of those included covariates. Of the 99, only 5 (about 5%) used any saturated specification, and only 1 used saturated specifications exclusively. About a third of TSLS-with-covariates papers explicitly invoked the LATE interpretation. No papers reported a test of rich covariates such as the Ramsey RESET test.

Q: What happens in the Card (1995) returns-to-education application? A: The original linear IV estimate of the return to education is 0.132. The RESET test overwhelmingly rejects the null of rich covariates. The DDML estimate of β_rich is modestly smaller, with a relative specification bias of about 0.076 (roughly 8%). The gap between β_iv and β_rich represents about 21% of the OLS–IV gap, which the authors characterize as a sizable fraction of the “selection bias” corrected by IV. The DDML estimate of the unconditional ACR/LATE (β_acr) is roughly half the size of β_rich.

Q: What happens in Nunn and Wantchekon (2011)? A: The RESET test overwhelmingly rejects rich covariates. The IV estimate of the slave trade effect on trust is nearly four times as large as the DDML estimate of β_rich. After reestimation, the null hypothesis that the slave trade had no impact on trust levels would not be rejected at conventional significance levels, reversing the paper’s central finding.

Q: What happens in Dube and Harish (2020)? A: The RESET test overwhelmingly rejects rich covariates. The DDML estimate of β_rich is about 20% smaller than the original IV estimate, representing roughly 40% of the OLS–IV gap. While estimated with similar precision, the DDML estimate is no longer significantly different from zero at conventional significance levels.

Q: Does Abadie’s (2003) kappa-weighting approach solve the problem? A: No. Proposition 7 shows that the kappa-weighted estimand β_abadie is weakly causal if and only if rich covariates holds. Moreover, when rich covariates holds, β_abadie is numerically identical to β_iv, so kappa weighting provides no additional benefit. When rich covariates fails, kappa weighting is not weakly causal for the same reason as standard IV.

Q: What does the Monte Carlo simulation show about practical alternatives? A: The simulation, calibrated to Card (1995) data with covariates (experience, region indicators), shows that: a linear IV specification without rich covariates converges to β_iv = 0.660, decomposed as +0.391 from positively-weighted compliers, +0.614 from positively-weighted always-takers, and −0.345 from negatively-weighted always-takers — when the true weakly causal quantity β_rich = 0.430. Saturated specifications converge to β_rich but exhibit substantial bias at small sample sizes relative to covariate support. DDML PLIV converges to β_rich with bias decreasing in sample size, making it the recommended practical estimator.

Q: What is the relationship between this paper and Sloczynski (2020, 2024)? A: Sloczynski (2020, 2024) maintains rich covariates as an assumption and shows that TSLS can still fail to be weakly causal if monotonicity direction varies with covariates and the first stage omits instrument-covariate interactions. This paper focuses on the necessity of rich covariates itself, under strong (unconditional) monotonicity. Taken together, the two papers establish that both rich covariates and a sufficiently flexible first stage are jointly necessary for TSLS to be interpretable as a non-negatively weighted average of LATEs.

Q: What practical recommendations do the authors offer? A: The authors recommend: (1) always running the Ramsey RESET test to check rich covariates, implementable in Stata or R; (2) if using a binary instrument, checking that fitted values L[Z|X] lie in [0,1], necessary for rich covariates; (3) using DDML PLIV to estimate the weakly causal β_rich nonparametrically; and (4) for binary instrument/treatment, using instrument propensity score weighting (e.g., Sloczynski et al. 2024) or Abadie kappa with correctly estimated E[Z|X] to target the unconditional ACR/LATE. All recommended methods are available in mature Stata or R packages.

Rich covariates: The condition that the linear projection of the instrument Z onto the included covariates X, denoted L[Z|X], exactly equals the true nonparametric conditional mean E[Z|X] at every point in the covariate support. This is both necessary and sufficient for the linear IV estimand to be weakly causal under exogeneity and monotonicity. It is automatically satisfied by saturated covariate specifications or when the instrument is mean-independent of covariates; otherwise it is an implicit parametric assumption.

Weakly causal estimand: An estimand β is weakly causal if, whenever all subgroup- and covariate-specific treatment effects have the same sign, β has that sign too. This is an intentionally minimal requirement — it merely asks that the estimand not be systematically misleading about the direction of causal effects. An estimand can be weakly causal and still be difficult to interpret as a specific population parameter.

Level dependence: The phenomenon in which a linear IV estimand depends not only on treatment effects (causal contrasts μ_j(g,x) − μ_{j−1}(g,x)) but also on the levels of potential outcomes (the baseline μ_0(g,x) terms). Level dependence arises when E[Z̃|X] = E[Z|X] − L[Z|X] ≠ 0, causing the always-taker and never-taker potential outcome levels to enter the estimand and potentially reverse its sign.

Local average treatment effect (LATE): The average treatment effect for the subpopulation of compliers — those whose treatment status is changed by the instrument. In the binary treatment, binary instrument case, LATE = E[Y(1) − Y(0) | T(1) > T(0)]. LATE has a concrete counterfactual interpretation and is non-negatively weighted by construction; the paper asks under what conditions TSLS actually estimates a weighted average of LATEs.

Partially linear IV (PLIV) / DDML: A modification of classical linear IV in which the linear function of covariates is replaced by an unknown nonparametric function, estimated using machine learning methods (random forests, gradient boosted trees, neural networks) with cross-fitting, as in Chernozhukov et al. (2018). The coefficient on treatment in the PLIV model equals β_rich, the weakly causal IV estimand that would result if rich covariates were exactly satisfied.

Unconditional average causal response (ACR): When the instrument is binary, ACR = E[Y(T(1)) − Y(T(0)) | T(1) > T(0)], which reduces to the unconditional LATE when treatment is also binary. ACR differs from β_rich because β_rich places extra weight on covariate values with more instrument variation, while ACR weights compliers equally regardless of covariate-specific instrument variance. The paper documents that DDML estimates of β_acr can be roughly half the size of β_rich.

Saturate and weight (SW) specification: The TSLS specification proposed by Angrist and Pischke (2009, Theorem 4.5.1), in which both covariates and instrument-covariate interactions are fully saturated as excluded variables in the first stage. SW is guaranteed to satisfy rich covariates and, under weak monotonicity allowing direction to vary with covariates, produces a non-negatively weighted average of covariate-specific LATEs. It was used by only one paper (Chamberlain and Imbens 2004) in the authors’ survey of 99 empirical IV papers.