C01 | Macro Paper Warehouse

Selection in Surveys: Using Randomized Incentives to Detect and Account for Nonresponse Bias

Mon, 01 Jan 0001 00:00:00 +0000

This paper addresses nonresponse bias in surveys — the distortion that arises when survey participants differ systematically from nonparticipants in ways that correlate with the survey’s outcomes of interest. The authors develop and apply methods to detect and correct for nonresponse bias using randomized financial incentives embedded in the survey design itself.

The empirical application is the “Norge i Koronatid” (NiK) survey, conducted by Statistics Norway in April–May 2020 to study the immediate labor market consequences of Norway’s COVID-19 lockdown. The NiK survey has two features that make it unusually well-suited for studying nonresponse bias: (1) it is linked to full-population administrative data, providing a verifiable ground truth for the entire Norwegian adult population; and (2) survey invitees were randomly assigned to one of five financial incentive levels (0%, 1%, 5%, 7%, or 10% probability of receiving a 1,000 NOK prepaid card), generating exogenous variation in participation rates. The final sample of 10,000 randomly drawn adults achieved a 47.4% participation rate.

The administrative data reveal large, statistically significant nonresponse bias across all six labor market outcomes examined. Participants in the high-incentive arm had on average roughly 930 USD (30%) higher monthly pre-lockdown earnings than the full population, and were 10.8 percentage points (19%) more likely to be employed. Standard corrections for selection on observable characteristics — including propensity-score reweighting on age, gender, immigration status, schooling, and municipality-level variables — fail to eliminate this bias. For the high-incentive arm, reweighting on individual characteristics more than doubles the nonresponse bias for earnings loss and employment loss measures relative to unweighted estimates, meaning that observable-based corrections can make things worse, not better.

A key finding is that higher participation rates do not imply lower nonresponse bias. The high-incentive arm, with the highest response rate, exhibited larger nonresponse bias than the no-incentive arm. Marginal participants — those induced to respond by higher incentives — had much stronger pre-lockdown labor market attachment (average earnings of 6,806 USD/month vs. 3,666 USD/month for inframarginal participants) but suffered substantially greater lockdown impacts: 32.3% became furloughed or unemployed versus only 3.4% of inframarginal participants.

Existing methods designed to handle selection on unobservables also perform poorly. Worst-case (Manski) bounds contain the truth but are very wide: employment before lockdown is bounded between 30% and 83% against a true value of 57%. Monotone response selection assumptions produce bounds that do not contain the population quantities for any of the six outcomes, because the marginal survey response function is empirically non-monotone. A Heckman parametric selection model produces point estimates inconsistent with the ground truth (e.g., estimating 51% pre-lockdown employment against the true 57%).

Investigation of participation timing reveals that reminder emails attract a qualitatively different type of respondent than incentives do. This motivates the paper’s central methodological contribution: a two-dimensional participation model that distinguishes “active” nonparticipants (those who received the invitation and chose not to respond because the incentive was insufficient) from “passive” nonparticipants (those who never received or attended to the invitation but who may respond to reminders). These two groups have labor market outcomes that differ from participants in opposite directions, which is why single-dimensional monotone selection models fail. The two-dimensional model, exploiting both incentive randomization and the timing of responses, produces bounds that contain or are closer to the ground truth than all other methods examined — for example, bounding pre-lockdown employment at [48%, 63%] around the true value of 57%.

The paper is scoped to a high-quality, randomly sampled, administrative-data-linked survey conducted during a period of acute economic disruption. The authors note the patterns observed may differ outside crisis periods, though the methods developed apply generally.

Q: How prevalent is nonresponse bias discussion in economics research, and what methods do researchers currently use? A: A systematic review of survey-based papers in top-five economics journals from January 2015 to August 2020 found that nearly half of studies omit any discussion of nonresponse bias despite often high nonresponse rates. Among studies using researcher-collected survey data, the average nonresponse rate is 50%; rates reach as high as 87%. When researchers do address nonresponse, 47% of own-survey papers compare sample means to a reference population and 16% apply reweighting on observables; virtually none use methods that address selection on unobservables.

Q: How was the NiK survey designed to enable testing for nonresponse bias? A: The 10,000-person random sample was assigned to five incentive groups with probabilities of receiving a 1,000 NOK credit card set at 0%, 1%, 5%, 7%, and 10%, yielding expected payoffs ranging from 1.1 USD to 11 USD. Because group assignment was random, the groups are probabilistically identical ex ante, so differences in average responses across groups — given an exclusion restriction that incentives do not directly affect answers — provide a direct test for nonresponse bias. Participation rates across the aggregated no/low/high incentive groups were 45.7%, approximately 47.6%, and approximately 51.7%, respectively; the joint test of equal participation across groups rejects with p-value < 0.01.

Q: How large is nonresponse bias in the NiK survey as measured against the administrative ground truth? A: Across all six administrative outcomes and all three incentive arms, joint tests of no nonresponse bias are rejected with p-values < 0.01. High-incentive arm participants had pre-lockdown monthly earnings roughly 930 USD (30%) above the population mean, and were 10.8 percentage points (19%) more likely to be employed. The high-incentive arm’s estimated post-lockdown employment rate of 58% overstates the true rate by 8 percentage points; a researcher comparing this to the true pre-lockdown rate of 57% would erroneously conclude employment was essentially unchanged, when in fact it dropped 7 percentage points.

Q: Does correcting for observable characteristics remove nonresponse bias? A: No. After reweighting by propensity scores constructed from age, gender, immigration status, schooling, and municipality or individual-level characteristics, joint tests of zero remaining nonresponse bias are rejected with p-values < 0.01 for each specification and incentive arm. In some cases, reweighting on individual characteristics more than doubles the nonresponse bias — for example, for earnings loss and employment loss measures in the high-incentive arm — meaning that standard observable-based corrections can amplify rather than reduce bias. Robustness checks using machine learning algorithms, class weights, imputation, and richer covariate sets including lagged outcomes yield the same conclusion.

Q: Does nonresponse bias in survey responses (not just administrative outcomes) differ across incentive arms? A: Yes. For survey-elicited outcomes, average responses differ significantly across incentive arms, with all joint equality tests rejected at p < 0.1. For example, 10.4% of high-incentive participants reported applying for UI benefits versus 7.5% in the no-incentive group. Estimated UI expenditure as a share of Norway’s 2020 social insurance budget varies from 13.2% (no-incentive arm) to 18.4% (high-incentive arm), illustrating the policy stakes.

Q: Do higher response rates reduce nonresponse bias? A: Not in this survey. The no-incentive arm, with the lowest participation rate (45.7%), exhibits smaller nonresponse bias than the high-incentive arm (51.7% participation). This finding contradicts standard guidance from the U.S. Office of Management and Budget and J-PAL research guidelines, which equate higher response rates with lower bias risk. The authors note that J-PAL has subsequently updated its guidance in response to this paper’s findings.

Q: How do marginal participants (induced by higher incentives) differ from inframarginal participants? A: Marginal participants — those who participate only under high incentives but not without them — had average pre-lockdown monthly earnings of 6,806 USD versus 3,666 USD for inframarginal participants (p-value 0.08), indicating much stronger pre-lockdown labor market attachment. Post-lockdown, both groups had similar earnings (approximately 3,600–3,800 USD/month). Consistent with this, 32.3% of marginal participants became furloughed or unemployed after the lockdown versus 3.4% of inframarginal participants. Notably, marginal and inframarginal participants do not differ significantly on observable background characteristics (age, gender, immigrant status, schooling; joint test p-value 0.70), confirming that selection is on unobservables.

Q: Why do existing methods designed to handle selection on unobservables fail? A: Worst-case (Manski) bounds contain the truth but are too wide to be informative — pre-lockdown employment is bounded at [30%, 83%] against a true value of 57%. Adding randomized incentives as instruments tightens bounds only modestly (8.5% width reduction for employment before lockdown). Monotone response selection assumptions fail because the empirically estimated marginal survey response function is non-monotone: for employment, the probability first decreases and then increases as a function of willingness-to-participate. The Heckman parametric selection model gives point estimates inconsistent with the ground truth for most outcomes (e.g., 51% estimated pre-lockdown employment vs. 57% true).

Q: What motivates the two-dimensional participation model? A: Analysis of participation timing shows that reminder emails attract a qualitatively different type of respondent than incentives alone. Reminders have a larger proportional effect on participation in the no-incentive group than in the high-incentive group, both in absolute and proportional terms. Early respondents (responding to initial contact) had lower pre-lockdown earnings and employment than late respondents (responding to reminders). This implies that the two types of unobservables — resistance to incentive and probability of receiving the invitation — are associated with outcomes that move in opposite directions, producing a non-monotone marginal survey response function that single-dimensional models cannot capture.

Q: How does the two-dimensional model work and what are its results? A: The model distinguishes active nonparticipants (saw the invitation, declined because the incentive was too low — more likely to be employed and higher earners) from passive nonparticipants (did not receive or attend to the invitation — more likely to have been adversely affected by the lockdown). By exploiting both the randomized incentive variation and the timing of responses (initial contact vs. reminder), the model partially identifies population mean outcomes under shape restrictions on the joint distribution of the two unobservables. For pre-lockdown employment, the model produces bounds of [48%, 63%] bracketing the true value of 57%, compared to worst-case bounds of [34%, 83%] and monotone selection bounds that do not contain the truth. Improvements are largest for pre-lockdown levels outcomes where the two types of nonparticipants differ most.

Q: What are the practical recommendations for survey researchers? A: Embedding randomized incentives in surveys at little or no additional cost enables an inexpensive test for nonresponse bias that does not require linked administrative data. When such a test detects bias, researchers should apply the two-dimensional model rather than relying on observable-based reweighting or conventional selection models. The question of who participates matters at least as much as how many participate; surveys should be designed to characterize and correct for selection, not merely to maximize response rates.

Nonresponse bias: The difference between the mean response among survey participants and the true population mean, arising when the decision to participate is correlated with the outcome of interest. Distinct from sampling bias; it persists even with a randomly drawn sample.

Selection on unobservables: Nonresponse bias that remains after conditioning on all observed characteristics. In the NiK survey, marginal and inframarginal participants are indistinguishable on observable demographics but differ dramatically in labor market outcomes, providing direct evidence that unobservables drive selection.

Marginal vs. inframarginal participants: Under the Imbens-Angrist monotonicity condition, inframarginal participants would respond at any incentive level; marginal participants respond only at higher incentive levels. Their average responses are separately identified using an IV regression with the incentive as instrument.

Marginal survey response (MSR): The function m(u) = E[Y*_i | U_i = u], giving the average outcome for individuals at the uth quantile of willingness to participate. The MSR is nonparametrically identified for u in [0, p(z_high)]; its empirically non-monotone shape in the NiK data explains why monotone selection assumptions produce bounds that miss the ground truth.

Active vs. passive nonparticipants: Active nonparticipants received the survey invitation and declined because the incentive was insufficient; they tend to have higher labor market attachment. Passive nonparticipants never received or attended to the invitation but may respond to reminders; they tend to have been more adversely affected by the lockdown. This distinction motivates the two-dimensional model.

Two-dimensional participation model: A model of survey participation with two unobservables — resistance to incentive (determining active nonresponse) and probability of receiving the invitation (determining passive nonresponse). By exploiting both incentive randomization and the timing of responses (initial contact vs. reminder), the model produces bounds or point estimates on population means that are narrower and closer to ground truth than single-dimensional alternatives.

Exclusion restriction for incentives: The assumption that randomly assigned incentives affect participation rates but do not directly affect participants’ answers to survey questions. This is required for incentives to serve as valid instruments for testing and correcting nonresponse bias; the authors test and find no evidence that it is violated.

The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely

Mon, 01 Jan 0001 00:00:00 +0000

This paper addresses a fundamental challenge in program evaluation: primary outcomes of interest — such as lifetime earnings or long-term employment — are often observed only with lengthy delays, forcing researchers to rely on short-term outcomes when making timely policy decisions. The authors develop a formal framework for combining multiple short-term proxy outcomes (surrogates) into a single “surrogate index” that, under stated assumptions, identifies the average treatment effect on the long-run primary outcome.

The methodological contribution rests on three key assumptions. First, Unconfoundedness: treatment assignment in the experimental sample is ignorable conditional on pre-treatment variables. Second, Surrogacy (Prentice 1989): the long-term primary outcome is independent of the treatment conditional on the surrogates — formally, Wi ⊥⊥ Yi | Si, Xi, Pi=E — meaning the entire causal path from treatment to primary outcome runs through the surrogates. Third, Comparability: the conditional distribution of the primary outcome given surrogates and pre-treatment variables is identical across the experimental and observational samples. This last assumption is novel relative to the prior surrogacy literature, which implicitly relied on it without formal statement.

The paper operates with two distinct samples. The experimental sample contains treatment assignment and surrogate outcomes but not the long-term primary outcome. The observational sample contains surrogates and primary outcomes but not treatment assignment. The surrogate index is defined as the conditional expectation of the primary outcome given surrogates and pre-treatment variables estimated in the observational sample, µ(s,x,O) = E[Yi|Si=s, Xi=x, Pi=O]. Under all three assumptions, the average treatment effect on this index equals the average treatment effect on the primary outcome. Under a linear specification, the estimator reduces to multiplying the vector of treatment effects on surrogates (from the experimental sample) by the regression coefficients predicting the primary outcome from surrogates (from the observational sample).

The paper derives semiparametric efficiency bounds, demonstrating that exploiting the surrogacy assumption — by replacing actual outcomes Yi with the predicted surrogate index µ(Si,Xi,O) — yields strictly lower variance than a standard randomized experiment that directly observes the primary outcome. The precision gain equals the variance of the residual Yi − µ(Si,Xi,O).

The authors also characterize bias when Surrogacy or Comparability fail. Crucially, even without these assumptions, the estimators consistently estimate a well-defined causal quantity — the average treatment effect on the surrogate index — providing a principled aggregation of intermediate outcomes. Formal bounds on the extent of bias are derived; without bounded outcomes, these bounds are uninformative, but with binary outcomes or bounded violations, sharp intervals are available.

The empirical application uses the Greater Avenues to Independence (GAIN) job training program, a randomized trial in California. The experimental sample is Riverside (NE,T = 4,405 treated, NE,C = 1,040 control), with 36 quarters of post-assignment outcomes. The observational sample pools three other counties (Alameda, Los Angeles, San Diego; NO = 13,725). Long-run benchmarks are a 6.4 percentage point (s.e. 1.2 pp) increase in mean quarterly employment rates and a $249 (s.e. $83) increase in mean quarterly earnings, each averaged over 36 quarters. All three surrogate-based estimators (surrogate index, surrogate score, influence function) fall within two standard errors of these benchmarks when surrogates include as few as 5 quarters of employment, earnings, and aid outcomes. By 6 quarters, the surrogate index estimate for employment is 0.061 (s.e. 0.006) versus the 0.064 benchmark. The “naive” estimator — which simply uses the treatment effect on short-run outcomes directly — requires more than 25 quarters before falling within two standard errors of the benchmark. The surrogate index achieves a 35% reduction in standard errors relative to directly waiting to observe the 9-year outcome.

Q: What is the surrogate index, precisely? A: The surrogate index is the conditional expectation of the primary outcome given surrogate outcomes and pre-treatment variables, estimated in the observational sample: µ(s,x,O) = E[Yi | Si=s, Xi=x, Pi=O]. It aggregates multiple short-term proxy variables into a scalar index through their predicted value for the long-run outcome. Under the Prentice Surrogacy assumption, the average treatment effect on this index equals the treatment effect on the primary outcome.

Q: What is the Prentice Surrogacy assumption, and why is it demanding? A: Surrogacy requires Wi ⊥⊥ Yi | Si, Xi, Pi=E — the long-run outcome is independent of the treatment conditional on the surrogates and pre-treatment variables. This means the surrogates must fully capture all causal pathways from treatment to outcome; any direct effect of the treatment on the primary outcome that does not pass through the measured surrogates violates the assumption. The authors note this is not testable in the two-sample setup because Yi and Wi are never jointly observed.

Q: What is the Comparability assumption, and why is it novel? A: Comparability requires Pi ⊥⊥ Yi | Si, Xi — the distribution of primary outcomes given surrogates and pre-treatment variables is identical across the experimental and observational samples. It formalizes the implicit condition under which the observational sample can be used to estimate the surrogate-to-outcome relationship that is then applied to the experimental sample. The authors state this assumption was not previously articulated in the surrogacy literature despite being implicitly relied upon.

Q: How does the paper handle violations of Surrogacy and Comparability? A: Theorem 4 shows that even without Surrogacy or Comparability (but maintaining Unconfoundedness), the estimators converge to a valid causal quantity: E[µ(Si(1),Xi,O) − µ(Si(0),Xi,O) | Pi=E], the average treatment effect on the surrogate index. The surrogacy-bias equals E[(µ(Si,1,Xi,E) − µ(Si,0,Xi,E)) · ρ(Si,Xi)(1−ρ(Si,Xi)) / (ρ(Xi)(1−ρ(Xi))) | Pi=E], which is small when the treatment explains little variation in Yi conditional on surrogates, or when the surrogate score is near zero or one. The comparability-bias depends on the product of the cross-sample discrepancy in the surrogate index and the deviation of the surrogate score from the propensity score.

Q: What are the efficiency gains from using surrogates? A: Theorem 2(ii) shows that in the limit as the observational sample grows large relative to the experimental sample, the efficiency bound using surrogates is strictly smaller than the Hahn (1998) bound for a direct randomized experiment. The gain equals E[(1−Wi)(Yi−µ(Si,Xi,O))²/(1−ρ(Xi))² + Wi(Yi−µ(Si,Xi,O))²/ρ(Xi)² | Pi=E] — the variance of the residual from predicting Yi with the surrogate index. Theorem 3 also characterizes the efficiency gain within a single sample from imposing the Surrogacy assumption itself, which equals E[σ²(Si,Xi,E) · ρ(Si,Xi)(1−ρ(Si,Xi)) / (ρ(Xi)²(1−ρ(Xi))²)].

Q: Why do multiple surrogates improve on a single surrogate? A: Multiple surrogates make the Surrogacy assumption more plausible, analogously to how multiple pre-treatment covariates make Unconfoundedness more plausible. If a treatment affects the primary outcome through several distinct causal channels (e.g., math skills, language skills, social skills), any single surrogate capturing only one channel leaves remaining pathways uncontrolled, producing bias. With multiple noisy measures of underlying mediators, even if no single observable fully satisfies Surrogacy, their combination removes more bias than any individual measure. The authors also illustrate via Figure 1.D that multiple surrogates reduce the “teaching to the test” problem, where improving a single measured surrogate does not translate to improvements in the primary outcome.

Q: What is the double matching estimator? A: For a treated unit i with covariates Xi and surrogates Si, the estimator first finds a control match j in the experimental sample based on Xi alone (so Xj ≈ Xi). It then finds, for each of units i and j, the nearest neighbor in the observational sample using both Xi and Si jointly, yielding observed outcomes Yi’ and Yj’. The estimated individual treatment effect is Yi’−Yj’, and the estimator averages these across the experimental sample. This mirrors standard matching under unconfoundedness but requires two layers of matching — within the experimental sample on pre-treatment variables, and into the observational sample on both pre-treatment variables and surrogates.

Q: What do the GAIN empirical results show quantitatively? A: The experimental benchmark for Riverside is a 6.4 pp (s.e. 1.2 pp) increase in mean quarterly employment and a $249 (s.e. $83) increase in mean quarterly earnings, each averaged over 36 quarters. The surrogate index estimator using 6 quarters yields estimates of 0.061 (s.e. 0.006) for employment and $238.8 (s.e. $31.5) for earnings — both within one standard error of the benchmark. All three surrogate-based estimators are within two standard errors of the benchmark at 5 quarters. The naive estimator (direct short-run effect) requires more than 25 quarters to come within two standard errors. The surrogate approach achieves a 35% reduction in standard errors relative to waiting for 9-year outcomes.

Q: How do the authors validate the Surrogacy and Comparability assumptions empirically? A: To test Surrogacy, they regress the primary outcome on pre-treatment variables, surrogates up to quarter t, and the treatment indicator in the Riverside experimental sample: a statistically significant treatment coefficient indicates a violation. Point estimates are large and significant for t ≤ 3 quarters; for t ≥ 4 most t-statistics fall below 2, though some remain slightly above 2 with small coefficient magnitudes. To test Comparability, they pool the experimental and observational samples and include an indicator for the experimental sample; significant coefficients on this indicator signal that the surrogate-to-outcome relationship differs across samples. The Comparability violation indicator remains statistically significant even with many surrogate periods, suggesting residual concern.

Q: How does the paper relate Surrogacy to the mediation and instrumental variables literatures? A: In mediation, all three variables — treatment, mediator, outcome — are observed in the same sample, and the goal is to decompose the total effect into direct and indirect components; Surrogacy corresponds to the case where the direct effect is zero by assumption. In the IV framework, the surrogate corresponds to the endogenous treatment, but an unobserved confounder between surrogate and outcome violates Surrogacy. The IV exclusion restriction (no direct effect of the instrument on the outcome) is the analog of Surrogacy’s requirement of no direct treatment effect on the primary outcome. The paper formalizes these analogies through directed acyclical graphs.

Q: What is the missing data interpretation of the key assumptions? A: The joint conditional independence Pi ⊥⊥ Yi ⊥⊥ Wi | Si, Xi implies both Surrogacy and Comparability simultaneously. This is closely related to the Missing at Random (MAR) assumption: the missingness of Yi in the experimental sample and of Wi in the observational sample is determined entirely by the observed surrogates and pre-treatment variables. This “data fusion” interpretation allows insights from the missing data literature — including semiparametric efficiency results — to apply directly.

Q: What is the proposed strategy for building credibility across studies? A: The authors advocate constructing a “library” of surrogate indices by systematically cataloging, across multiple studies in a given domain, the smallest set of surrogates that reliably matches long-run treatment effects. If six quarters of employment and earnings data are established across multiple job training programs to predict 9-year impacts — as the cross-site GAIN comparisons suggest — then future job training evaluations could credibly report long-run impact estimates after only six quarters. The empirical application is presented as one element of such a library.

Surrogate Index: The conditional expectation of the primary outcome given surrogate outcomes and pre-treatment variables, estimated in the observational sample — µ(s,x,O) = E[Yi|Si=s, Xi=x, Pi=O]. It aggregates multiple short-term proxy variables into a scalar that, under Surrogacy and Comparability, identifies the average treatment effect on the long-run outcome.

Prentice Surrogacy Assumption: The condition Wi ⊥⊥ Yi | Si, Xi, Pi=E — the long-run primary outcome is independent of the treatment conditional on the surrogates and pre-treatment variables. Operationally, this requires that all causal pathways from treatment to primary outcome pass through the measured surrogates, with no direct effect remaining.

Comparability Assumption: Pi ⊥⊥ Yi | Si, Xi — the conditional distribution of the primary outcome given surrogates and pre-treatment variables is identical in the experimental and observational samples. This formalizes the condition under which the observational sample’s surrogate-to-outcome relationship can be transported to the experimental sample.

Surrogate Score: The conditional probability of treatment given surrogates and pre-treatment variables in the experimental sample, ρ(s,x) = Pr(Wi=1|Si=s, Xi=x, Pi=E). Plays an analogous role in the surrogate framework to the propensity score under unconfoundedness: if Surrogacy holds conditional on (Si,Xi), it also holds conditional on the surrogate score alone.

Sampling Score: The conditional probability of belonging to the experimental sample given surrogates and pre-treatment variables, φ(s,x) = Pr(Pi=E|Si=s, Xi=x). Appears in the surrogate score estimator and influence function to reweight observations from the observational sample toward the experimental sample distribution.

Double Robustness: The influence function estimator is doubly robust: it remains consistent if either (a) the conditional outcome models µ(s,x,O) and µ(w,x) are correctly specified regardless of the score models, or (b) the propensity score ρ(s,x), propensity score ρ(x), and sampling score φ(s,x) are correctly specified regardless of the outcome models.

Surrogacy Bias: The bias arising when Surrogacy fails while Comparability holds, equal to E[(µ(Si,1,Xi,E) − µ(Si,0,Xi,E)) · ρ(Si,Xi)(1−ρ(Si,Xi)) / (ρ(Xi)(1−ρ(Xi))) | Pi=E]. It is driven by the product of the direct treatment effect on the outcome (conditional on surrogates) and a measure of how much the surrogates explain treatment assignment.