Forthcoming [Journal of Political Economy] doi:10.1086/742418

Optimal Tests Following Sequential Experiments

Karun Adusumilli

Canonical DOI Free to read · GREEN Open access ↗

What this paper finds — and why it matters

This paper addresses a practical gap in the inference literature for sequential and adaptive experiments: while the design of such experiments has been studied extensively, there is little theory characterizing which tests are optimal once the experiment concludes. Adusumilli asks what the best hypothesis test looks like after a sequential experiment — a costly sampling design, a group sequential trial, or a bandit experiment — and whether the complexity of the adaptive protocol can be reduced to a manageable set of sufficient statistics for inference purposes.

The methodological core is the derivation of two Asymptotic Representation Theorems (ARTs). The first ART applies to stopping-time experiments, where the sampling rule is fixed in advance but the stopping time is fully adaptive (updated after every observation). The second ART allows the sampling rule itself to be adaptive, but requires that both the sampling and stopping decisions are updated only a finite number of times after observing batches of data. Both ARTs establish that the asymptotic power function of any test in the original sequential experiment can be matched by a test in a limit experiment in which a Gaussian process is observed for each treatment and inference is made on the drifts of those processes.

The key sufficiency result is a dimension reduction: regardless of the number of batches or the complexity of the adaptive protocol, any candidate test’s asymptotic power can be reproduced by a test that depends only on a fixed, finite set of statistics. For stopping-time experiments, the sufficient statistics are the stopped value of the score process (parametric) or the efficient influence function process (non-parametric), together with the stopping time. For batched experiments with adaptive sampling, the sufficient statistics are the final allocation proportions for each treatment (q_1, q_0) and the final values of the influence function processes (x_1, x_0) — a fixed dimension of 2d+2 regardless of the number of batches. This stands in contrast to the earlier ART of Hirano and Porter (2023), whose state variables grow linearly with the number of batches.

The paper then characterizes optimal tests within the limit experiment under several criteria. Under no restriction, the Neyman-Pearson lemma yields the uniformly most powerful (UMP) test for a point alternative. For testing linear combinations of the parameter vector, a further dimension reduction applies and a UMP test exists in the limit experiment, depending only on a scalar projection of the sufficient statistic. Under unbiasedness, any valid test must satisfy an orthogonality condition on the stopped process. Under an alpha-spending constraint — where the overall size alpha is pre-allocated across stages — optimal stage-specific thresholds are derived. Under a weighted average power criterion, the optimal test takes the form of a likelihood ratio statistic integrated against the weight function.

Three application classes are treated with explicit optimal procedures. For horizontal boundary designs (stopping when a test statistic crosses a fixed threshold, including the SPRT and the Neyman-allocation design from Adusumilli 2022), the most powerful asymptotically unbiased test rejects when the stopping time falls below a specific quantile of its null distribution. Monte Carlo simulations show the test achieves nominal 5% size even for small n, while the standard two-sample test has actual size near 9% in the same setting. For group sequential trials (including O’Brien-Fleming designs with T=2 stages), the paper derives stage-specific critical values satisfying the alpha-spending constraint, with numerical simulations confirming the asymptotic approximation is close to nominal for small n, though accuracy degrades for larger values of the null mean. For bandit experiments run with a batched Thompson-sampling algorithm (K=2 treatments, J=10 batches), the paper constructs the power envelope and shows it is asymmetric: distinguishing (a, 0) from (0, 0) is easier than distinguishing (-a, 0) from (0, 0) for a > 0, because Thompson sampling directs more observations to the arm with higher estimated mean, reducing informativeness from the other arm. Simulations confirm the asymptotic approximation is accurate for as few as n=20 observations per batch (200 total).

The framework covers both parametric and non-parametric models. The non-parametric setting replaces the score process with the efficient influence function process, and the asymptotic power bound translates directly. Results also apply to conditional power given the stopping time.

Q: What is the core methodological contribution of the paper? A: The paper derives two Asymptotic Representation Theorems (ARTs) showing that the asymptotic power function of any test following a sequential experiment can be matched by a test in a Gaussian-diffusion limit experiment. The first ART covers stopping-time experiments with fully adaptive stopping rules; the second covers batched experiments with adaptive sampling rules. These ARTs reduce the infinite-dimensional adaptive experiment to a tractable limit object.

Q: What are the sufficient statistics for inference, and why does this matter? A: For stopping-time experiments, the sufficient statistics are the stopped value of the score (parametric) or efficient influence function (non-parametric) process, together with the stopping time. For batched experiments with adaptive sampling over K treatments, the sufficient statistics are the final allocation fractions (q_1, q_0) and the final influence function process values (x_1, x_0), a fixed dimension of 2d+2. This matters because it establishes that all the adaptive complexity of the protocol can be discarded: a test that uses only these statistics is asymptotically as powerful as any test that uses the full sample path.

Q: How does this paper extend or differ from Hirano and Porter (2023)? A: Hirano and Porter (2023) derive an ART for batched sequential experiments whose state variables grow linearly with the number of batches, making the limit experiment increasingly complex. Adusumilli shows that only a fixed number of sufficient statistics (2d+2) are needed to match unconditional asymptotic power, irrespective of the number of batches. The paper also extends to non-parametric models, derives optimal conditional tests given stopping times, and covers fully adaptive stopping-time experiments via a different route (Le Cam 1979) that does not require the batching restriction.

Q: What is the result for testing linear combinations of the parameter? A: When the null hypothesis is H0: a^T h = 0 in the limit experiment, a further dimension reduction applies: the UMP test depends only on a scalar projection x-tilde(tau) = sigma^{-1} a^T I^{-1/2} x(tau) and the stopping time tau. Because under the null this projection is a standard Brownian motion evaluated at the stopping time, the test is pivotal and uniformly most powerful for the composite hypothesis, regardless of the nuisance components of h.

Q: What is the unbiasedness condition in the limit experiment? A: A test phi is unbiased if its power exceeds its size under all alternatives. In the Gaussian limit experiment, Proposition 2 shows that any unbiased test of H0: h=0 vs H1: h≠0 must satisfy the moment condition E_0[x(tau) phi(tau, x(tau))] = 0, which is obtained by differentiating the power function at h=0 and applying the unbiasedness constraint. This condition restricts which tests can be considered, and the optimal unbiased test is characterized within this class.

Q: What is the alpha-spending criterion and what does the paper show about it? A: Alpha-spending (introduced by Gordon Lan and DeMets, 1983) pre-allocates the total size alpha across T stages via a spending vector (alpha_1, …, alpha_T) with sum equal to alpha, and requires that the conditional rejection probability at stage t not exceed alpha_t. Theorem 2 shows that for discrete stopping times, the asymptotic conditional power beta_n(h|t) converges to beta(h|t) in the limit experiment on subsequences, enabling the derivation of optimal stage-specific thresholds satisfying the spending constraint.

Q: What is the key finding for horizontal boundary designs with a fixed sampling rule? A: For experiments that stop when the influence function process first crosses a fixed threshold gamma — including the SPRT and the Neyman-allocation costly-sampling design of Adusumilli (2022) — Lemma 1 establishes that the UMP asymptotically unbiased test of H0: mu_1 = mu_0 is the test that rejects when the stopping time tau-hat falls below the alpha-quantile of its null distribution. Monte Carlo evidence shows this test achieves nominal 5% size even for small n, while a naive two-sample test ignoring the adaptive stopping rule has actual size near 9%.

Q: What does the power envelope look like for Thompson-sampling bandit experiments, and why is it asymmetric? A: For Thompson-sampling bandit experiments with K=2 arms and J=10 batches, the power envelope for testing H0: (mu_1, mu_0) = (0, 0) is asymmetric: it is easier to distinguish the alternative (a, 0) from the null than to distinguish (-a, 0) for the same a > 0. The mechanism is that Thompson sampling allocates more observations to the arm with the higher estimated mean, so a positive treatment effect leads to more data for treatment arm 1 and less for arm 0, making the joint test more informative in one direction than the other.

Q: How accurate are the asymptotic approximations in finite samples? A: For horizontal boundary designs, Monte Carlo simulations show size is close to nominal 5% even for small n. For group sequential trials with an O’Brien-Fleming design (T=2 stages), the approximation is close to nominal for small n but degrades for larger values of the null mean mu-bar. For Thompson-sampling bandit experiments with K=2 arms and J=10 batches, the approximation is accurate for as few as n=20 observations per batch (200 total observations).

Q: How does the paper handle non-parametric models? A: In non-parametric settings, the sufficient statistic is the efficient influence function process x_n(t) = (sigma^{-1}/sqrt(n)) sum_{i=1}^{floor(nt)} psi(Y_i), where psi is the efficient influence function for the functional of interest and sigma^2 = E[psi^2]. Proposition 3 establishes that the asymptotic power of any test is bounded above by the power envelope in the Gaussian limit experiment indexed by this process. The non-parametric and linear-combination parametric cases share the same limit structure.

Q: What are the open questions identified by the author? A: Two main limitations are noted. First, the ART for adaptive sampling rules is established only for batched experiments; whether it extends to fully adaptive (non-batched) sampling rules without loss of power is conjectured but not formally verified. Second, for fully adaptive experiments, the alpha-spending characterization is not yet available, and the author suggests exploring invariance restrictions or conditional inference as alternative optimality criteria.

Asymptotic Representation Theorem (ART): A result showing that the asymptotic power function of any test in the original sequential experiment can be matched by that of a test in a Gaussian-diffusion limit experiment; used to transfer optimality results from the limit to the original problem.

Limit experiment (Gaussian diffusion): The limiting statistical model in which one observes a Gaussian process x(t) = I^{1/2} h t + W(t) for each treatment, with unknown drift vector h; inference on h in this experiment characterizes optimal tests in the original sequential experiment.

Sufficient statistics (for sequential inference): The finite set of statistics that, in the limit experiment, capture all power-relevant information from the adaptive experiment: for stopping-time experiments, the stopped score/influence function process value and the stopping time; for batched adaptive experiments, the final allocation fractions (q_a) and final influence function values (x_a) for each treatment arm.

Alpha-spending constraint: A strengthened size requirement in group sequential trials that pre-allocates the total Type I error alpha across stages via a spending vector (alpha_1, …, alpha_T); requires that conditional rejection probability at each stage t not exceed alpha_t, and sum alpha_t = alpha.

Efficient influence function process: In a non-parametric model, the partial-sum process x_n(t) = (sigma^{-1}/sqrt(n)) sum_{i=1}^{floor(nt)} psi(Y_i), where psi is the efficient influence function for the target functional; this process is the non-parametric analogue of the score process and serves as the sufficient statistic for non-parametric sequential inference.

Stopping-time experiment: A sequential experiment in which the sampling rule (how to allocate observations across treatments) is fixed before the experiment begins but the stopping rule (when to terminate) is fully adaptive and updated after every observation.

Power envelope: The supremum of the asymptotic power function over all tests of a given size; computed in the limit experiment via the Neyman-Pearson lemma and the Girsanov theorem, and serves as an upper bound on the power of any feasible test in the original sequential experiment.

How this summary was made. Bibliographic fields are pulled from Crossref and OpenAlex and are not model-generated. The summary was drafted from the open-access manuscript , checked by a claim-grounding and calibration review pass, and approved before publishing. Found an error or a misrepresentation? Flag it here — corrections are welcome, especially from the authors.