Testing Mechanisms
What this paper finds — and why it matters
Kwon and Roth develop econometric tests for the “sharp null of full mediation”: the hypothesis that a treatment D affects an outcome Y only through a specified mechanism (or set of mechanisms) M, with no direct pathway. Rather than attempting the more demanding task of identifying average direct and indirect effects — which typically requires strong assumptions about how M is assigned — the paper asks whether full mediation is consistent with the data at all, and if not, how large the alternative mechanisms are.
The key theoretical observation is that under the sharp null of full mediation, together with independence of D and monotonicity of M in D, the treatment D satisfies the conditions for a valid instrumental variable for the local average treatment effect (LATE) of M on Y. This equivalence means that existing tools for testing IV validity with binary endogenous treatment can be applied off-the-shelf when both D and M are binary. The paper then extends this framework to the general case where M is a p-dimensional vector with finite support, and where the researcher can impose arbitrary restrictions on the distribution of compliance types θ_{lk} = P(M(0)=m_l, M(1)=m_k) — including monotonicity, relaxations allowing a bounded share of defiers, elementwise monotonicity for multidimensional M, or no restrictions.
The testable implications of the sharp null require that there exists type shares θ̃ in the identified set Θ_I such that sup_A Δ_k(A) ≤ Σ_{l≠k} θ̃_{lk} for all k, where Δ_k(A) is the treatment-control difference in the probability of the compound outcome {Y∈A, M=m_k}. The intuition is that any positive treatment effect on this compound outcome can only be driven by compliers, not by always-takers who under the sharp null have both fixed M and fixed Y. Because Θ_I is characterized by linear constraints when R is, verifying the testable implications reduces to a linear program. The paper proves these implications are sharp: if satisfied, there exists a joint distribution of potential outcomes consistent with the data and the sharp null. The paper also derives sharp lower bounds on ν_k = P(Y(1,m_k) ≠ Y(0,m_k) | M(1)=M(0)=m_k), the fraction of k-always-takers whose outcome is affected despite having the same mediator value under both arms.
For inference, the testable implications are reformulated as moment inequalities and the Cox-Shi (2022) test is recommended based on Monte Carlo simulations calibrated to the empirical applications, which find close-to-nominal size across nearly all designs (null rejection probability no larger than 9% for a 5% test), with the exception of settings with only 40 clusters where CS is over-sized at 0.15 but recovers with 80 clusters.
The methodology is illustrated in two RCT applications. In Bursztyn, González, and Yanagizawa-Drott (2020), where an information treatment about other men’s beliefs is randomized in Saudi Arabia and the outcome is wives’ job applications, the sharp null that effects operate only through job-search service sign-up is rejected (p=0.02, CS test); the lower bound on the fraction of never-takers affected despite no change in sign-up is at least 11%, compared to an overall ATE of 0.12, with the lower bound remaining positive for defier shares up to 7%. In Baranov et al. (2020), where cognitive behavioral therapy for new mothers is randomized and the outcome is financial empowerment at seven-year follow-up, the sharp null is rejected for grandmother presence alone (p=0.02, lower bound ≥19% of never-takers affected) and for relationship quality alone (p=0.03, lower bound ≥10% of always-takers affected); however, when both mechanisms are considered jointly, the sharp null cannot be rejected at conventional levels (p=0.65), indicating the data are statistically consistent with the combination of these two mechanisms fully explaining the treatment effect.
Scope conditions: the main results assume D is randomly assigned (extended in Section 5 to IV, conditional unconfoundedness, and distributional difference-in-differences settings) and M has finite support. An R package, TestMechs, accompanies the paper.
Q1: What is the sharp null of full mediation and how does it differ from standard mediation analysis objectives? The sharp null posits that Y(d,m) depends only on m and not on d — that is, Y(0,m) = Y(1,m) almost surely for all m — meaning the treatment affects the outcome exclusively through its effect on M. Standard mediation analysis seeks to decompose the average treatment effect into average direct and indirect components, which requires identifying the causal effect of M on Y and thus typically imposes sequential unconfoundedness or an instrument for M. The sharp null test asks only whether any direct effect exists for any individual, which is answerable without identifying the causal effect of M on Y and therefore under substantially weaker assumptions.
Q2: What is the core identification insight connecting mediation testing to IV validity testing? Under the sharp null of full mediation, combined with independence of D and monotonicity of M(d) in d, the treatment D satisfies exactly the LATE assumptions as an instrument for the effect of M on Y. Consequently, testable implications of the LATE assumptions — developed in Kitagawa (2015), Huber and Mellace (2015), and Mourifié and Wan (2017) — translate directly into testable implications of the sharp null when both D and M are binary. This equivalence allows researchers to apply off-the-shelf IV validity tests for mechanism testing with no additional methodological development in the binary-binary case.
Q3: What are the sharp testable implications of the sharp null in the general multi-valued, multi-dimensional M case? The sharp testable implications require that there exists a vector of type shares θ̃ in the identified set Θ_I (consistent with observed marginal distributions of M|D and the researcher’s restrictions R) such that sup_A Δ_k(A) ≤ Σ_{l≠k} θ̃_{lk} for all k, where Δ_k(A) = P(Y∈A, M=m_k|D=1) − P(Y∈A, M=m_k|D=0). The intuition is that any positive treatment effect on the compound outcome 1{Y∈A, M=m_k} can only be driven by compliers transitioning into state k; always-takers have fixed M=m_k and under the sharp null also have fixed Y, so they contribute zero. The testable implications are proved to be sharp: if they hold, there exists a joint distribution of potential outcomes consistent with the data and the sharp null.
Q4: How does the paper quantify the magnitude of violation when the sharp null is rejected? The paper derives sharp lower bounds on ν_k = P(Y(1,m_k) ≠ Y(0,m_k) | M(1)=M(0)=m_k), the fraction of k-always-takers whose outcome is affected by the treatment despite having the same mediator value under both arms. The lower bound is θ_{kk}·ν_k ≥ (sup_A Δ_k(A) − Σ_{l≠k} θ_{lk})₊, which is sharp in the sense that there exists a distribution of potential outcomes achieving equality. Appendix B.1 additionally derives bounds on ADE_k = E[Y(1,m_k)−Y(0,m_k)|M(1)=M(0)=m_k], the average direct effect for k-always-takers.
Q5: How is inference conducted and which test is recommended? Because the test statistic involves the solution to a linear program whose constraints depend on the data, and sup_A Δ_k(A) can be non-differentiable in the data-generating process — making standard bootstrap methods invalid — the paper reformulates the testable implications as moment inequalities of the form H₀: ∃ω s.t. C₁ω − C₂p ≥ 0, where C₁ and C₂ are known matrices and p collects observable conditional probabilities. Methods from the moment inequality literature (Andrews, Roth, and Pakes, 2023; Cox and Shi, 2022; Fang, Santos, Shaikh, and Torgovitsky, 2023) are then directly applicable. Cox and Shi (2022) is recommended as a default based on Monte Carlo evidence.
Q6: What do the Monte Carlo simulations reveal about size and power? Across nearly all simulation designs calibrated to the two empirical applications, the ARP, CS, and K tests achieve close-to-nominal size, with null rejection probabilities no larger than 9% for a nominal 5% test. The notable exception is settings with only 40 independent clusters, where CS is over-sized with a null rejection probability of 0.15; doubling to 80 clusters restores approximate size control. For power, CS performs similarly to or better than ARP across all designs, with the advantage being substantial in some cases, particularly with multi-valued M. The FSST test can be substantially over-sized in settings with small or moderate numbers of clusters.
Q7: What does the Bursztyn et al. (2020) application find? The treatment is random assignment of information about other men’s beliefs about women working outside the home in Saudi Arabia; the mediator is job-search service sign-up (binary); the outcome is whether the wife applies for jobs three to five months later. The sharp null is rejected with p=0.02 (CS test), establishing that the information treatment affects long-run labor market outcomes through pathways other than mechanical service sign-up. The lower bound on the fraction of never-takers affected despite no change in sign-up is at least 11%; the estimated average direct effect for these never-takers ranges from 0.11 to 0.18, compared to an overall ATE of 0.12. The lower bound remains positive for defier shares up to 7% of the population (0.33 defiers per complier), providing robustness to violations of monotonicity.
Q8: What does the Baranov et al. (2020) application find? The treatment is cognitive behavioral therapy for pregnant women and new mothers (randomized RCT); the outcome is an index of financial empowerment at seven-year follow-up. For the binary mechanism of grandmother presence in the household, the sharp null is rejected (CS p=0.02) with a lower bound of at least 19% of never-takers affected. For relationship quality with husband (1-5 scale, under monotonicity that CBT improves the relationship), the sharp null is rejected (CS p=0.03) with a pooled lower bound of at least 10% of always-takers affected. When both mechanisms are considered jointly as a vector M, the sharp null cannot be rejected (CS p=0.65) and the lower bound on the fraction of always-takers affected is 7%, indicating the data are statistically consistent with the combination of these two mechanisms fully explaining the CBT effect on financial empowerment.
Q9: How does the framework accommodate relaxations of monotonicity? The paper allows the researcher to specify arbitrary closed non-empty subsets R of the simplex as restrictions on type shares θ. Monotonicity in the binary case corresponds to R = {θ∈Δ: θ_{10}=0}, ruling out defiers. A relaxation allows up to d̄ fraction of the population to be defiers (θ_{10} ≤ d̄). In the Bursztyn et al. (2020) application, the estimated lower bound on ν_k remains positive for d̄ up to 0.07. One can also completely remove monotonicity by setting R = Δ, though this yields less informative bounds. For multidimensional M, elementwise monotonicity imposes that each dimension of M(d) is increasing in d.
Q10: How does the paper extend to non-experimental settings? Section 5 shows that results extend whenever the distributions of (Y^tot(d), M(d)) are identified through strategies other than direct randomization of D. Under a standard IV setup with binary instrument Z for D, the LATE of D on Y and D on M are identified for instrument-compliers, and the same testable implications apply within this subpopulation. Under conditional unconfoundedness D ⊥ (Y(·,·), M(·)) | X with overlap, distributions are identified via propensity-score reweighting. Under distributional difference-in-differences (Athey and Imbens, 2006; Callaway and Li, 2019; Roth and Sant’Anna, 2023), counterfactual distributions of Y and M for treated units are identified, enabling the same testing approach.
Q11: What is the paper’s relationship to the principal stratification literature? The k-always-takers — those with M(1)=M(0)=m_k — correspond directly to principal strata (Frangakis and Rubin, 2002). The bounds on ADE_k derived in Appendix B.1 match those of Lee (2009), Flores and Flores-Lagunes (2010), and Zhang and Rubin (2003) in the special case of binary M under monotonicity, and extend them to non-binary M and relaxations of monotonicity. The primary focus of the present paper is the sharp (Fisherian) null that ν_k = 0 for all k — that is, no always-taker is affected — which is strictly stronger than the weak null of zero average direct effect studied in the principal stratification literature.
Q12: What are the limitations and directions for future work identified by the authors? The analysis is restricted to discrete M; while M can be discretized under assumptions described in Remark 3, testing the sharp null directly for continuous M remains an open question for future work. The framework does not impose restrictions on the magnitude of M’s effect on Y or on the degree of endogeneity of M, and incorporating such restrictions could yield sharper testable implications. Extension to non-binary treatments D is also identified as a direction for future research.
Sharp null of full mediation: The hypothesis that Y(0,m) = Y(1,m) almost surely for all m in the support of M — i.e., the treatment D affects the outcome Y exclusively through its effect on M, with no direct effect on any individual’s outcome. This is a Fisherian sharp null, strictly stronger than a zero average direct effect.
k-always-takers: Individuals for whom M(1)=M(0)=m_k — those whose mediator value equals m_k regardless of treatment assignment. Under the sharp null, these individuals’ outcomes must be unaffected by the treatment. They constitute the principal stratum with fixed mediator value m_k and generalize the always-taker and never-taker concepts from the binary LATE framework.
ν_k (fraction of always-takers affected): ν_k = P(Y(1,m_k) ≠ Y(0,m_k) | M(1)=M(0)=m_k), the fraction of k-always-takers whose outcome is affected by the treatment despite having the same mediator value under both arms. Under the sharp null ν_k = 0 for all k; a large ν_k indicates strong alternative mechanisms operating outside of M for always-takers with mediator value m_k.
Type shares θ_{lk}: The fractions of the population of each compliance type, θ_{lk} = P(M(0)=m_l, M(1)=m_k). These generalize the LATE compliance categories (always-takers, never-takers, compliers, defiers) to the multi-valued mediator setting. The vector θ may be only partially identified when M is non-binary, with the identified set Θ_I characterized by linear constraints matching observed marginal distributions of M|D.
Δ_k(A): The treatment-control difference in the probability of the compound outcome {Y∈A, M=m_k}: Δ_k(A) = P(Y∈A, M=m_k|D=1) − P(Y∈A, M=m_k|D=0). The supremum of Δ_k(A) over all sets A is the key estimable quantity that appears in both the testable implications and the lower bounds on ν_k.
Identified set Θ_I: The set of type-share vectors θ̃ consistent with the observed marginal distributions of M|D=0 and M|D=1, and with the researcher’s restrictions on compliance types R. When R is characterized by linear constraints (as in all main examples), Θ_I is a polytope and optimization over it — required for implementing the testable implications — is a linear program.
TestMechs R package: The accompanying software implementation of the inference methods and lower bound estimators developed in the paper, designed to facilitate empirical application of the tests.