Macro Paper Warehouse Forthcoming macro & monetary research
Forthcoming [Review of Economic Studies] doi:10.1093/restud/rdag029

When is TSLS Actually LATE?

Christine Blandhol

John Bonney

Magne Mogstad

Alexander Torgovitsky

What this paper finds — and why it matters

This paper asks: when does two-stage least squares (TSLS) with covariates actually estimate a local average treatment effect (LATE) — a non-negatively weighted average of causal effects for compliers only? The authors show that the answer is: almost never in practice.

The paper’s central theoretical result (Theorem 1) is that a linear IV estimand is weakly causal — meaning it cannot have the wrong sign relative to all underlying treatment effects — if and only if the IV specification has “rich covariates,” defined as the condition that the linear projection of the instrument onto the covariates, L[Z|X], equals the true conditional mean E[Z|X] at every covariate value. Saturated specifications (nonparametric covariate control) always satisfy rich covariates. Outside of two special cases — saturated covariates or an instrument that is mean-independent of covariates — rich covariates is an implicit parametric assumption that can fail.

When rich covariates fails, the TSLS estimand is “level dependent”: it depends not only on treatment effects for compliers but also on the levels of potential outcomes for always-takers and never-takers, some of which receive negative weight. The problem arises mechanically because the numerator of the IV estimand, E[Y Z̃], contains a term E[E[Y|X] E[Z̃|X]] that reflects untreated-outcome levels rather than causal contrasts. This term vanishes only when E[Z̃|X] = E[Z|X] − L[Z|X] = 0, i.e., rich covariates.

To document how common this failure is in practice, the authors surveyed 122 empirical IV papers published in five top economics journals (JPE, AER, QJE, ReStud, Econometrica) between January 2000 and October 2018. Of the 99 papers using TSLS with covariates, only 5 used a saturated specification at any point and only 1 (Chamberlain and Imbens 2004) used saturated specifications exclusively. Nearly a third of TSLS-with-covariates papers explicitly invoked the LATE interpretation; none reported a test of rich covariates.

The paper applies these findings to thirteen empirical studies. In Card (1995), the original IV estimate of returns to education is 0.132; the Ramsey RESET test overwhelmingly rejects rich covariates, and a DDML estimate of the weakly causal quantity β_rich is modestly smaller, with a relative specification bias of roughly 8% and the gap between β_iv and β_rich representing about 21% of the OLS–IV gap. In Nunn and Wantchekon (2011), the IV estimate of the slave trade’s effect on trust is nearly four times as large as the DDML estimate; after reestimation, the null of no effect would not be rejected at conventional significance levels. In Dube and Harish (2020), the DDML estimate of β_rich is about 20% smaller than the original IV estimate (roughly 40% of the OLS–IV gap) and is no longer significantly different from zero at conventional levels.

The paper also shows that Abadie’s (2003) kappa-weighting approach fails under the same necessary condition: it is weakly causal if and only if rich covariates holds, at which point it is numerically identical to standard IV — leaving no reason to use it. Monte Carlo simulations calibrated to Card (1995) show that saturated specifications can exhibit substantial finite-sample bias when the covariate support is large relative to the sample, while DDML partially linear IV (PLIV) converges to β_rich with decreasing bias as sample size grows.

The authors conclude that two conditions are jointly necessary for TSLS to be interpretable as a non-negatively weighted average of LATEs: (i) rich covariates, and (ii) a first-stage flexible enough to capture any covariate-varying direction of monotonicity. Both conditions fail routinely in published work. The recommended alternatives are: DDML PLIV for estimating β_rich (a weakly causal weighted average of conditional LATEs), or instrument propensity score weighting / Abadie kappa with correctly estimated E[Z|X] for estimating the unconditional ACR/LATE. The Ramsey RESET test is offered as a practical diagnostic for rich covariates violations, and it detected sizable discrepancies in each of the thirteen applications examined.

Q: What is the paper’s central theoretical result? A: Theorem 1 establishes that, given conditional exogeneity and monotonicity, the linear IV estimand β_iv is weakly causal — i.e., cannot systematically misrepresent the sign of treatment effects — if and only if the IV specification has rich covariates (L[Z|X] = E[Z|X] for every covariate value x). Rich covariates is therefore simultaneously sufficient and necessary; the sufficient direction was a special case of Kolesar (2013), while the necessary direction is novel to this paper.

Q: What does “rich covariates” mean and when is it satisfied? A: Rich covariates means that the linear projection of the instrument onto the included covariates exactly reproduces the instrument’s true conditional mean at every point in the covariate support. It is automatically satisfied in two cases: when covariates are specified saturatedly (with an indicator for each covariate cell), or when the instrument is mean-independent of all covariates so E[Z|X] is a constant. Outside these cases, rich covariates is an implicit parametric functional form assumption.

Q: What goes wrong when rich covariates fails? A: When L[Z|X] ≠ E[Z|X], the IV estimand becomes “level dependent”: it depends not only on treatment effects (causal contrasts) but also on the levels of potential outcomes for always-takers and never-takers. Because always-takers always receive Y(1) and never-takers always receive Y(0), the estimand picks up these levels through the term E[E[Y|X] E[Z̃|X]], which is nonzero whenever E[Z̃|X] = E[Z|X] − L[Z|X] ≠ 0. This can cause β_iv to be negative even when all complier and always-taker treatment effects are positive.

Q: How is the paper’s critique different from the two-way fixed effects (TWFE) literature? A: The TWFE literature (Goodman-Bacon 2021; Sun and Abraham 2021) identifies negative-weight problems arising from heterogeneous treatment effects due to cohort timing, but those estimands are not level dependent. By contrast, the TSLS problems identified here involve level dependence and persist even under constant, homogeneous treatment effects (Proposition 5), making the critique more fundamental and harder to dismiss by assuming effect homogeneity.

Q: Does the problem disappear if treatment effects are constant? A: No. Proposition 5 shows that rich covariates remains necessary for β_iv to be weakly causal even under Assumption CLE (constant, linear treatment effects). Level dependence occurs whenever E[Z̃|X] ≠ 0, regardless of effect heterogeneity. The only additional assumption that can substitute is Assumption LIN (linear potential outcome means), which together with constant effects implies β_iv = Δ exactly (Proposition 6), but this combination is a strong parametric restriction.

Q: What does the survey of empirical papers find? A: Of 122 IV papers in top journals from 2000–2018, 112 used TSLS, and 99 of those included covariates. Of the 99, only 5 (about 5%) used any saturated specification, and only 1 used saturated specifications exclusively. About a third of TSLS-with-covariates papers explicitly invoked the LATE interpretation. No papers reported a test of rich covariates such as the Ramsey RESET test.

Q: What happens in the Card (1995) returns-to-education application? A: The original linear IV estimate of the return to education is 0.132. The RESET test overwhelmingly rejects the null of rich covariates. The DDML estimate of β_rich is modestly smaller, with a relative specification bias of about 0.076 (roughly 8%). The gap between β_iv and β_rich represents about 21% of the OLS–IV gap, which the authors characterize as a sizable fraction of the “selection bias” corrected by IV. The DDML estimate of the unconditional ACR/LATE (β_acr) is roughly half the size of β_rich.

Q: What happens in Nunn and Wantchekon (2011)? A: The RESET test overwhelmingly rejects rich covariates. The IV estimate of the slave trade effect on trust is nearly four times as large as the DDML estimate of β_rich. After reestimation, the null hypothesis that the slave trade had no impact on trust levels would not be rejected at conventional significance levels, reversing the paper’s central finding.

Q: What happens in Dube and Harish (2020)? A: The RESET test overwhelmingly rejects rich covariates. The DDML estimate of β_rich is about 20% smaller than the original IV estimate, representing roughly 40% of the OLS–IV gap. While estimated with similar precision, the DDML estimate is no longer significantly different from zero at conventional significance levels.

Q: Does Abadie’s (2003) kappa-weighting approach solve the problem? A: No. Proposition 7 shows that the kappa-weighted estimand β_abadie is weakly causal if and only if rich covariates holds. Moreover, when rich covariates holds, β_abadie is numerically identical to β_iv, so kappa weighting provides no additional benefit. When rich covariates fails, kappa weighting is not weakly causal for the same reason as standard IV.

Q: What does the Monte Carlo simulation show about practical alternatives? A: The simulation, calibrated to Card (1995) data with covariates (experience, region indicators), shows that: a linear IV specification without rich covariates converges to β_iv = 0.660, decomposed as +0.391 from positively-weighted compliers, +0.614 from positively-weighted always-takers, and −0.345 from negatively-weighted always-takers — when the true weakly causal quantity β_rich = 0.430. Saturated specifications converge to β_rich but exhibit substantial bias at small sample sizes relative to covariate support. DDML PLIV converges to β_rich with bias decreasing in sample size, making it the recommended practical estimator.

Q: What is the relationship between this paper and Sloczynski (2020, 2024)? A: Sloczynski (2020, 2024) maintains rich covariates as an assumption and shows that TSLS can still fail to be weakly causal if monotonicity direction varies with covariates and the first stage omits instrument-covariate interactions. This paper focuses on the necessity of rich covariates itself, under strong (unconditional) monotonicity. Taken together, the two papers establish that both rich covariates and a sufficiently flexible first stage are jointly necessary for TSLS to be interpretable as a non-negatively weighted average of LATEs.

Q: What practical recommendations do the authors offer? A: The authors recommend: (1) always running the Ramsey RESET test to check rich covariates, implementable in Stata or R; (2) if using a binary instrument, checking that fitted values L[Z|X] lie in [0,1], necessary for rich covariates; (3) using DDML PLIV to estimate the weakly causal β_rich nonparametrically; and (4) for binary instrument/treatment, using instrument propensity score weighting (e.g., Sloczynski et al. 2024) or Abadie kappa with correctly estimated E[Z|X] to target the unconditional ACR/LATE. All recommended methods are available in mature Stata or R packages.

Rich covariates: The condition that the linear projection of the instrument Z onto the included covariates X, denoted L[Z|X], exactly equals the true nonparametric conditional mean E[Z|X] at every point in the covariate support. This is both necessary and sufficient for the linear IV estimand to be weakly causal under exogeneity and monotonicity. It is automatically satisfied by saturated covariate specifications or when the instrument is mean-independent of covariates; otherwise it is an implicit parametric assumption.

Weakly causal estimand: An estimand β is weakly causal if, whenever all subgroup- and covariate-specific treatment effects have the same sign, β has that sign too. This is an intentionally minimal requirement — it merely asks that the estimand not be systematically misleading about the direction of causal effects. An estimand can be weakly causal and still be difficult to interpret as a specific population parameter.

Level dependence: The phenomenon in which a linear IV estimand depends not only on treatment effects (causal contrasts μ_j(g,x) − μ_{j−1}(g,x)) but also on the levels of potential outcomes (the baseline μ_0(g,x) terms). Level dependence arises when E[Z̃|X] = E[Z|X] − L[Z|X] ≠ 0, causing the always-taker and never-taker potential outcome levels to enter the estimand and potentially reverse its sign.

Local average treatment effect (LATE): The average treatment effect for the subpopulation of compliers — those whose treatment status is changed by the instrument. In the binary treatment, binary instrument case, LATE = E[Y(1) − Y(0) | T(1) > T(0)]. LATE has a concrete counterfactual interpretation and is non-negatively weighted by construction; the paper asks under what conditions TSLS actually estimates a weighted average of LATEs.

Partially linear IV (PLIV) / DDML: A modification of classical linear IV in which the linear function of covariates is replaced by an unknown nonparametric function, estimated using machine learning methods (random forests, gradient boosted trees, neural networks) with cross-fitting, as in Chernozhukov et al. (2018). The coefficient on treatment in the PLIV model equals β_rich, the weakly causal IV estimand that would result if rich covariates were exactly satisfied.

Unconditional average causal response (ACR): When the instrument is binary, ACR = E[Y(T(1)) − Y(T(0)) | T(1) > T(0)], which reduces to the unconditional LATE when treatment is also binary. ACR differs from β_rich because β_rich places extra weight on covariate values with more instrument variation, while ACR weights compliers equally regardless of covariate-specific instrument variance. The paper documents that DDML estimates of β_acr can be roughly half the size of β_rich.

Saturate and weight (SW) specification: The TSLS specification proposed by Angrist and Pischke (2009, Theorem 4.5.1), in which both covariates and instrument-covariate interactions are fully saturated as excluded variables in the first stage. SW is guaranteed to satisfy rich covariates and, under weak monotonicity allowing direction to vary with covariates, produces a non-negatively weighted average of covariate-specific LATEs. It was used by only one paper (Chamberlain and Imbens 2004) in the authors’ survey of 99 empirical IV papers.

How this summary was made. Bibliographic fields are pulled from Crossref and OpenAlex and are not model-generated. The summary was drafted from the open-access manuscript , checked by a claim-grounding and calibration review pass, and approved before publishing. Found an error or a misrepresentation? Flag it here — corrections are welcome, especially from the authors.