A Model of Multiple Hypothesis Testing
What this paper finds — and why it matters
This paper develops an economic framework for determining when and how much multiple hypothesis testing (MHT) adjustment is warranted in research settings. The research question is: under what conditions do MHT adjustments arise as an optimal solution to incentive misalignment between a researcher and a mechanism designer (social planner)?
The model is a two-stage game. In the first stage, a benevolent social planner commits to a hypothesis testing protocol. In the second stage, a researcher decides whether to conduct a pre-specified experiment based on private costs and benefits. The planner’s utility function combines an ambiguity-averse (maximin) component—limiting harm from mistaken conclusions—with an expected-utility component capturing the generic benefits of research production. The framework focuses on multiplicity arising from testing multiple treatments or estimating effects within multiple subpopulations; multiple outcomes are treated as an economically distinct case covered in a companion paper.
The main theoretical result is that separate t-tests are uniformly globally optimal under linearity of the researcher’s payoff and welfare functions and normality of test statistics. The optimal critical value takes the explicit form: t(J, Σ) = Φ⁻¹(1 − C(J, Σ) / (b · |J|)), where |J| is the number of hypotheses, C(J, Σ) is the experiment cost, and b is the researcher’s per-rejection benefit. This formula nests two limiting cases. When costs are fully fixed (invariant to |J|), the formula delivers a Bonferroni correction. When costs scale proportionally with the number of hypotheses, no MHT adjustment is warranted—because the researcher already faces sufficient deterrent from the incremental cost of each additional test.
The key economic mechanism is as follows. In the worst states of the world (where all treatments are harmful relative to the status quo), a research study has only downside risk for society. The planner must keep the researcher’s expected payoff from false positives low enough that she chooses not to experiment. If critical values were invariant to |J|, for sufficiently many hypotheses the researcher’s expected payoff from false positives alone would exceed costs, inducing unwanted experimentation. Some upward adjustment to critical values (i.e., tighter thresholds) is therefore generically optimal. The same logic implies that critical values should also adjust for sample size, since larger samples raise costs.
The framework is calibrated to two empirical applications. For FDA clinical trial approval, using Sertkaya et al. (2016) data on approximately 31,000 U.S. pharmaceutical trials (2004–2012), fixed costs constitute approximately 46% of average total trial cost. At a benchmark significance level of 5% and benchmark sample size, the optimal level is approximately 3.2% for two tests, 2.6% for three tests, and asymptotes to approximately 1.4% as |J| → ∞. Sidak’s correction yields 2.5% and 1.7% for two and three tests respectively, and tends to zero as |J| → ∞—more conservative than the model implies. Optimal adjustments must also be less conservative for larger samples to preserve researcher incentives to bear the correspondingly larger costs.
For program evaluation in development economics, the paper uses a unique dataset of funding proposals submitted to J-PAL from 2009 to 2021. The estimated cost elasticity with respect to the number of treatment arms ranges from 0.13 to 0.22 (p < 0.05), indicating costs rise significantly but far less than proportionally. The implied optimal significance levels are slightly less conservative than Bonferroni/Sidak corrections but more conservative than unadjusted testing.
Scope conditions: the framework assumes pre-specified experiments (no p-hacking), linear payoffs, normally distributed statistics, and a researcher whose preferences are common knowledge. The analysis focuses on multiple treatments and subpopulations, not multiple outcomes. Results extend to imperfectly informed researchers and heterogeneous variances.
Q: What is the core mechanism by which MHT adjustments arise as optimal in this framework? A: The planner must deter experimentation in the worst-case states—those where all treatments are harmful. If the testing protocol did not adjust for the number of hypotheses, a researcher testing sufficiently many hypotheses could earn enough expected payoff from false positives alone to justify experimentation, even when all treatments are truly harmful. Tighter critical values (higher thresholds) reduce the probability of false positives and thus cap the researcher’s expected payoff in the null space, deterring unwanted experimentation. This is the maximin optimality condition: the researcher’s expected payoff must be non-positive over the null space.
Q: What are the two limiting cases of the optimal critical value formula, and what do they correspond to? A: The optimal level of the separate t-tests is α(J, Σ) = C(J, Σ) / (b · |J|). When C(J, Σ) = ᾱ (costs are fixed, invariant to the number of hypotheses), this reduces to ᾱ/|J|, the Bonferroni correction. When C(J, Σ) = ᾱ · |J| (costs scale proportionally with the number of hypotheses), the optimal level equals ᾱ regardless of |J|—no MHT adjustment is warranted. The intuition for the second case is that proportional costs already deter excess testing; the researcher has no undue incentive to test many hypotheses because each additional test costs the same incremental amount.
Q: Why do optimal critical values also depend on sample size, and what is the policy implication? A: Since research costs C(J, Σ) increase with sample size (Σ captures design features including sample size), the optimal test level α(J, Σ) = C(J, Σ)/(b·|J|) rises with sample size. Equivalently, larger studies warrant less conservative significance thresholds. The policy implication is that a single uniform correction (e.g., Bonferroni at the 5% level) applied without regard to sample size is suboptimal: it is too conservative for large studies, which would over-deter valuable high-powered research.
Q: What are the two optimality properties required of protocols in the paper’s main characterization? A: The paper shows (Proposition 3.1) that a protocol is uniformly globally optimal—optimal for all values of the welfare weight λ and prior π—if and only if it is both maximin optimal and unbiased. Maximin optimality (Proposition 3.2) requires two conditions: the researcher’s expected payoff must be non-positive over the null space (deterring experimentation when all treatments are harmful), and expected welfare must be non-negative when some treatments are beneficial. Unbiasedness requires that the researcher’s maximum power strictly exceeds the test size, ensuring that experimentation is motivated when treatments are genuinely beneficial.
Q: How does the paper rationalize conventional hypothesis testing asymmetry (type I vs. type II error weighting) without extreme restrictions? A: In Tetenov (2012), justifying 5%-level testing with minimax regret in a single-agent model requires the decision-maker to place 102 times more weight on type I than type II regret—an extreme restriction. In this paper, the asymmetry arises naturally from the planner’s desire to prevent harmful treatment implementation: the planner is willing to forgo some power (probability of detecting beneficial treatments) to ensure that harmful treatments are not implemented. The researcher’s private incentives and the planner’s objective diverge in a way that makes tight size control endogenously optimal.
Q: What does the FDA empirical calibration imply quantitatively about optimal versus standard adjustments? A: Using Sertkaya et al. (2016) data showing that fixed costs are 46% of average total trial cost for U.S. pharmaceutical trials, and using Pocock et al. (2002) to set J̄ = 3 (average number of subgroups), the paper calculates that at a benchmark level of ᾱ = 0.05: the optimal level is approximately 3.2% for two tests, 2.6% for three tests, and asymptotes to approximately 1.4% as |J| → ∞. By contrast, Sidak’s correction yields 2.5%, 1.7%, and zero, respectively. Both the unadjusted 5% and the Sidak/Bonferroni levels are therefore suboptimal—the unadjusted level is too permissive while standard FWER corrections are too conservative.
Q: What do the J-PAL data reveal about optimal MHT adjustment in program evaluation? A: Using the universe of J-PAL funding proposals from 2009 to 2021, the paper estimates the cost elasticity with respect to the number of treatment arms to be 0.13–0.22, which is statistically significant (p < 0.05) but far below 1 (the proportional case). This means costs rise with arms but much less than proportionally. As a result, optimal significance levels for program evaluation studies are slightly less conservative than Sidak/Bonferroni corrections (e.g., approximately 3.8–4.5% versus 2.5% at a two-arm study with ᾱ = 5%) but more conservative than unadjusted testing. The testing thresholds also vary moderately with sample size, with larger samples implying less conservative procedures.
Q: When are cross-study MHT adjustments warranted according to the framework? A: Cross-study MHT adjustments are warranted only when there are cost complementarities across those studies. If studies are conducted independently with separate cost structures, each study’s costs do not depend on the number of hypotheses tested in other studies, so no cross-study adjustment is optimal. This provides a principled resolution to the disputed question of whether researchers should correct for tests performed in other papers.
Q: When is FWER control (e.g., Bonferroni or Sidak) the appropriate form of MHT adjustment? A: Appendix B.2 shows that FWER control is appropriate when the researcher’s payoff is nonlinear—specifically when the researcher requires at least one positive finding to receive any benefit (e.g., to publish). In the baseline linear payoff model, average size control (Bonferroni) is the correct adjustment only when all costs are fixed. The broader insight is that the form of compound error control—whether average error rate or FWER—is itself determined by economic fundamentals rather than being a statistical choice made in advance.
Q: How does the paper extend to cases of heterogeneous variances across hypotheses? A: Proposition 5.2 shows that under heterogeneous variances, the optimal protocol uses separate t-tests based on sample-equalizing allocations—dividing the sample equally across treatment arms—with critical values t*(J, n(J)) = Φ⁻¹(1 − C(J, n(J))/(b·|J|)), where n(J) is the total sample size. This protocol remains maximin optimal and unbiased, preserving the main qualitative results.
Q: What does the paper contribute relative to Tetenov (2016) on single-hypothesis testing? A: Tetenov (2016) showed that in the single-hypothesis case, separate t-tests are maximin optimal and uniformly most powerful (UMP) unbiased. This paper extends that result to multiple hypotheses, but two major complications arise: first, maximin optimality in the multi-hypothesis case requires verifying that welfare is non-negative even when treatment effects have opposite signs, which requires a non-trivial argument absent in the single-hypothesis case; second, no protocol is UMP unbiased in the multi-hypothesis case, so the paper develops a weaker notion of unbiasedness (power exceeding size) that is sufficient to motivate experimentation.
Q: Why do multiple outcomes require different procedures than multiple treatments or subpopulations? A: Multiple outcomes and multiple treatments are economically distinct types of multiplicity. For multiple outcomes that are noisy proxies for a common underlying quantity, the optimal rule tests an index formed using statistical weights (as in Anderson, 2008). When outcomes capture distinct components of the planner’s utility, economic weights are appropriate. In contrast, multiple treatments or subpopulations lead to separate t-tests with cost-adjusted critical values. Conflating these two forms of multiplicity leads to incorrect inferences about what procedures are appropriate.
Maximin optimality: A hypothesis testing protocol is maximin optimal if it maximizes the planner’s worst-case welfare across all parameter values, equivalent to two conditions: deterring researcher experimentation over the null space (where all treatments are harmful), and ensuring non-negative expected welfare when some treatments are beneficial.
Unbiasedness (in the paper’s sense): A protocol is unbiased if the researcher’s maximum achievable power strictly exceeds the test size, ensuring that experimentation is motivated when treatments are genuinely beneficial. This is a weaker condition than UMP unbiasedness, which does not exist in the multi-hypothesis case.
Uniform global optimality: A protocol is uniformly globally optimal if it maximizes the planner’s objective for all values of the welfare weight λ ≥ 0 and all priors π over the parameter space, making it robust to uncertainty about the relative importance of deterrence versus research motivation.
MHT correction factor: Defined as C(J, Σ) / (C̄ · |J|), this factor captures how the cost per test varies as the number of hypotheses grows. It equals 1/|J| (Bonferroni) when all costs are fixed, and equals 1 (no correction) when costs are proportional to the number of tests; the empirically appropriate correction lies strictly between these extremes.
Cost function C(J, Σ): The private cost borne by the researcher for conducting the experiment, which depends on both the set of treatments J and the experimental design Σ (including sample size). The degree of optimal MHT adjustment is a direct function of how this cost varies with the number of hypotheses tested.
Global null space Θ₀(J): The set of parameter vectors θ for which the welfare effect of implementing any combination of treatments is strictly negative—i.e., the status quo of no treatment dominates all interventions. Maximin optimality requires deterring researcher experimentation over this set.
Cost complementarities across studies: Cost structures in which conducting multiple studies together is cheaper than conducting them separately. Cross-study MHT adjustments are warranted if and only if such complementarities exist; absent complementarities, each study’s optimal threshold is set independently of others.