Forthcoming [Review of Economic Studies] doi:10.1093/restud/rdaf087

The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely

Susan Athey

Raj Chetty

Guido W Imbens

Hyunseung Kang

Canonical DOI Free to read · GREEN Open access ↗

What this paper finds — and why it matters

This paper addresses a fundamental challenge in program evaluation: primary outcomes of interest — such as lifetime earnings or long-term employment — are often observed only with lengthy delays, forcing researchers to rely on short-term outcomes when making timely policy decisions. The authors develop a formal framework for combining multiple short-term proxy outcomes (surrogates) into a single “surrogate index” that, under stated assumptions, identifies the average treatment effect on the long-run primary outcome.

The methodological contribution rests on three key assumptions. First, Unconfoundedness: treatment assignment in the experimental sample is ignorable conditional on pre-treatment variables. Second, Surrogacy (Prentice 1989): the long-term primary outcome is independent of the treatment conditional on the surrogates — formally, Wi ⊥⊥ Yi | Si, Xi, Pi=E — meaning the entire causal path from treatment to primary outcome runs through the surrogates. Third, Comparability: the conditional distribution of the primary outcome given surrogates and pre-treatment variables is identical across the experimental and observational samples. This last assumption is novel relative to the prior surrogacy literature, which implicitly relied on it without formal statement.

The paper operates with two distinct samples. The experimental sample contains treatment assignment and surrogate outcomes but not the long-term primary outcome. The observational sample contains surrogates and primary outcomes but not treatment assignment. The surrogate index is defined as the conditional expectation of the primary outcome given surrogates and pre-treatment variables estimated in the observational sample, µ(s,x,O) = E[Yi|Si=s, Xi=x, Pi=O]. Under all three assumptions, the average treatment effect on this index equals the average treatment effect on the primary outcome. Under a linear specification, the estimator reduces to multiplying the vector of treatment effects on surrogates (from the experimental sample) by the regression coefficients predicting the primary outcome from surrogates (from the observational sample).

The paper derives semiparametric efficiency bounds, demonstrating that exploiting the surrogacy assumption — by replacing actual outcomes Yi with the predicted surrogate index µ(Si,Xi,O) — yields strictly lower variance than a standard randomized experiment that directly observes the primary outcome. The precision gain equals the variance of the residual Yi − µ(Si,Xi,O).

The authors also characterize bias when Surrogacy or Comparability fail. Crucially, even without these assumptions, the estimators consistently estimate a well-defined causal quantity — the average treatment effect on the surrogate index — providing a principled aggregation of intermediate outcomes. Formal bounds on the extent of bias are derived; without bounded outcomes, these bounds are uninformative, but with binary outcomes or bounded violations, sharp intervals are available.

The empirical application uses the Greater Avenues to Independence (GAIN) job training program, a randomized trial in California. The experimental sample is Riverside (NE,T = 4,405 treated, NE,C = 1,040 control), with 36 quarters of post-assignment outcomes. The observational sample pools three other counties (Alameda, Los Angeles, San Diego; NO = 13,725). Long-run benchmarks are a 6.4 percentage point (s.e. 1.2 pp) increase in mean quarterly employment rates and a $249 (s.e. $83) increase in mean quarterly earnings, each averaged over 36 quarters. All three surrogate-based estimators (surrogate index, surrogate score, influence function) fall within two standard errors of these benchmarks when surrogates include as few as 5 quarters of employment, earnings, and aid outcomes. By 6 quarters, the surrogate index estimate for employment is 0.061 (s.e. 0.006) versus the 0.064 benchmark. The “naive” estimator — which simply uses the treatment effect on short-run outcomes directly — requires more than 25 quarters before falling within two standard errors of the benchmark. The surrogate index achieves a 35% reduction in standard errors relative to directly waiting to observe the 9-year outcome.

Q: What is the surrogate index, precisely? A: The surrogate index is the conditional expectation of the primary outcome given surrogate outcomes and pre-treatment variables, estimated in the observational sample: µ(s,x,O) = E[Yi | Si=s, Xi=x, Pi=O]. It aggregates multiple short-term proxy variables into a scalar index through their predicted value for the long-run outcome. Under the Prentice Surrogacy assumption, the average treatment effect on this index equals the treatment effect on the primary outcome.

Q: What is the Prentice Surrogacy assumption, and why is it demanding? A: Surrogacy requires Wi ⊥⊥ Yi | Si, Xi, Pi=E — the long-run outcome is independent of the treatment conditional on the surrogates and pre-treatment variables. This means the surrogates must fully capture all causal pathways from treatment to outcome; any direct effect of the treatment on the primary outcome that does not pass through the measured surrogates violates the assumption. The authors note this is not testable in the two-sample setup because Yi and Wi are never jointly observed.

Q: What is the Comparability assumption, and why is it novel? A: Comparability requires Pi ⊥⊥ Yi | Si, Xi — the distribution of primary outcomes given surrogates and pre-treatment variables is identical across the experimental and observational samples. It formalizes the implicit condition under which the observational sample can be used to estimate the surrogate-to-outcome relationship that is then applied to the experimental sample. The authors state this assumption was not previously articulated in the surrogacy literature despite being implicitly relied upon.

Q: How does the paper handle violations of Surrogacy and Comparability? A: Theorem 4 shows that even without Surrogacy or Comparability (but maintaining Unconfoundedness), the estimators converge to a valid causal quantity: E[µ(Si(1),Xi,O) − µ(Si(0),Xi,O) | Pi=E], the average treatment effect on the surrogate index. The surrogacy-bias equals E[(µ(Si,1,Xi,E) − µ(Si,0,Xi,E)) · ρ(Si,Xi)(1−ρ(Si,Xi)) / (ρ(Xi)(1−ρ(Xi))) | Pi=E], which is small when the treatment explains little variation in Yi conditional on surrogates, or when the surrogate score is near zero or one. The comparability-bias depends on the product of the cross-sample discrepancy in the surrogate index and the deviation of the surrogate score from the propensity score.

Q: What are the efficiency gains from using surrogates? A: Theorem 2(ii) shows that in the limit as the observational sample grows large relative to the experimental sample, the efficiency bound using surrogates is strictly smaller than the Hahn (1998) bound for a direct randomized experiment. The gain equals E[(1−Wi)(Yi−µ(Si,Xi,O))²/(1−ρ(Xi))² + Wi(Yi−µ(Si,Xi,O))²/ρ(Xi)² | Pi=E] — the variance of the residual from predicting Yi with the surrogate index. Theorem 3 also characterizes the efficiency gain within a single sample from imposing the Surrogacy assumption itself, which equals E[σ²(Si,Xi,E) · ρ(Si,Xi)(1−ρ(Si,Xi)) / (ρ(Xi)²(1−ρ(Xi))²)].

Q: Why do multiple surrogates improve on a single surrogate? A: Multiple surrogates make the Surrogacy assumption more plausible, analogously to how multiple pre-treatment covariates make Unconfoundedness more plausible. If a treatment affects the primary outcome through several distinct causal channels (e.g., math skills, language skills, social skills), any single surrogate capturing only one channel leaves remaining pathways uncontrolled, producing bias. With multiple noisy measures of underlying mediators, even if no single observable fully satisfies Surrogacy, their combination removes more bias than any individual measure. The authors also illustrate via Figure 1.D that multiple surrogates reduce the “teaching to the test” problem, where improving a single measured surrogate does not translate to improvements in the primary outcome.

Q: What is the double matching estimator? A: For a treated unit i with covariates Xi and surrogates Si, the estimator first finds a control match j in the experimental sample based on Xi alone (so Xj ≈ Xi). It then finds, for each of units i and j, the nearest neighbor in the observational sample using both Xi and Si jointly, yielding observed outcomes Yi’ and Yj’. The estimated individual treatment effect is Yi’−Yj’, and the estimator averages these across the experimental sample. This mirrors standard matching under unconfoundedness but requires two layers of matching — within the experimental sample on pre-treatment variables, and into the observational sample on both pre-treatment variables and surrogates.

Q: What do the GAIN empirical results show quantitatively? A: The experimental benchmark for Riverside is a 6.4 pp (s.e. 1.2 pp) increase in mean quarterly employment and a $249 (s.e. $83) increase in mean quarterly earnings, each averaged over 36 quarters. The surrogate index estimator using 6 quarters yields estimates of 0.061 (s.e. 0.006) for employment and $238.8 (s.e. $31.5) for earnings — both within one standard error of the benchmark. All three surrogate-based estimators are within two standard errors of the benchmark at 5 quarters. The naive estimator (direct short-run effect) requires more than 25 quarters to come within two standard errors. The surrogate approach achieves a 35% reduction in standard errors relative to waiting for 9-year outcomes.

Q: How do the authors validate the Surrogacy and Comparability assumptions empirically? A: To test Surrogacy, they regress the primary outcome on pre-treatment variables, surrogates up to quarter t, and the treatment indicator in the Riverside experimental sample: a statistically significant treatment coefficient indicates a violation. Point estimates are large and significant for t ≤ 3 quarters; for t ≥ 4 most t-statistics fall below 2, though some remain slightly above 2 with small coefficient magnitudes. To test Comparability, they pool the experimental and observational samples and include an indicator for the experimental sample; significant coefficients on this indicator signal that the surrogate-to-outcome relationship differs across samples. The Comparability violation indicator remains statistically significant even with many surrogate periods, suggesting residual concern.

Q: How does the paper relate Surrogacy to the mediation and instrumental variables literatures? A: In mediation, all three variables — treatment, mediator, outcome — are observed in the same sample, and the goal is to decompose the total effect into direct and indirect components; Surrogacy corresponds to the case where the direct effect is zero by assumption. In the IV framework, the surrogate corresponds to the endogenous treatment, but an unobserved confounder between surrogate and outcome violates Surrogacy. The IV exclusion restriction (no direct effect of the instrument on the outcome) is the analog of Surrogacy’s requirement of no direct treatment effect on the primary outcome. The paper formalizes these analogies through directed acyclical graphs.

Q: What is the missing data interpretation of the key assumptions? A: The joint conditional independence Pi ⊥⊥ Yi ⊥⊥ Wi | Si, Xi implies both Surrogacy and Comparability simultaneously. This is closely related to the Missing at Random (MAR) assumption: the missingness of Yi in the experimental sample and of Wi in the observational sample is determined entirely by the observed surrogates and pre-treatment variables. This “data fusion” interpretation allows insights from the missing data literature — including semiparametric efficiency results — to apply directly.

Q: What is the proposed strategy for building credibility across studies? A: The authors advocate constructing a “library” of surrogate indices by systematically cataloging, across multiple studies in a given domain, the smallest set of surrogates that reliably matches long-run treatment effects. If six quarters of employment and earnings data are established across multiple job training programs to predict 9-year impacts — as the cross-site GAIN comparisons suggest — then future job training evaluations could credibly report long-run impact estimates after only six quarters. The empirical application is presented as one element of such a library.

Surrogate Index: The conditional expectation of the primary outcome given surrogate outcomes and pre-treatment variables, estimated in the observational sample — µ(s,x,O) = E[Yi|Si=s, Xi=x, Pi=O]. It aggregates multiple short-term proxy variables into a scalar that, under Surrogacy and Comparability, identifies the average treatment effect on the long-run outcome.

Prentice Surrogacy Assumption: The condition Wi ⊥⊥ Yi | Si, Xi, Pi=E — the long-run primary outcome is independent of the treatment conditional on the surrogates and pre-treatment variables. Operationally, this requires that all causal pathways from treatment to primary outcome pass through the measured surrogates, with no direct effect remaining.

Comparability Assumption: Pi ⊥⊥ Yi | Si, Xi — the conditional distribution of the primary outcome given surrogates and pre-treatment variables is identical in the experimental and observational samples. This formalizes the condition under which the observational sample’s surrogate-to-outcome relationship can be transported to the experimental sample.

Surrogate Score: The conditional probability of treatment given surrogates and pre-treatment variables in the experimental sample, ρ(s,x) = Pr(Wi=1|Si=s, Xi=x, Pi=E). Plays an analogous role in the surrogate framework to the propensity score under unconfoundedness: if Surrogacy holds conditional on (Si,Xi), it also holds conditional on the surrogate score alone.

Sampling Score: The conditional probability of belonging to the experimental sample given surrogates and pre-treatment variables, φ(s,x) = Pr(Pi=E|Si=s, Xi=x). Appears in the surrogate score estimator and influence function to reweight observations from the observational sample toward the experimental sample distribution.

Double Robustness: The influence function estimator is doubly robust: it remains consistent if either (a) the conditional outcome models µ(s,x,O) and µ(w,x) are correctly specified regardless of the score models, or (b) the propensity score ρ(s,x), propensity score ρ(x), and sampling score φ(s,x) are correctly specified regardless of the outcome models.

Surrogacy Bias: The bias arising when Surrogacy fails while Comparability holds, equal to E[(µ(Si,1,Xi,E) − µ(Si,0,Xi,E)) · ρ(Si,Xi)(1−ρ(Si,Xi)) / (ρ(Xi)(1−ρ(Xi))) | Pi=E]. It is driven by the product of the direct treatment effect on the outcome (conditional on surrogates) and a measure of how much the surrogates explain treatment assignment.

How this summary was made. Bibliographic fields are pulled from Crossref and OpenAlex and are not model-generated. The summary was drafted from the open-access manuscript , checked by a claim-grounding and calibration review pass, and approved before publishing. Found an error or a misrepresentation? Flag it here — corrections are welcome, especially from the authors.