Optimal Decision Rules When Payoffs are Partially Identified
What this paper finds — and why it matters
This paper derives asymptotically optimal statistical decision rules for discrete choice problems when the payoffs associated with some choices are only partially identified. The research question is: how should a decision maker who can bound but not point-identify a payoff-relevant parameter θ use data to make optimal policy choices?
The framework separates two parameter types. The reduced-form parameter µ is point-identified and can be estimated from data. The structural parameter θ — such as the average treatment effect (ATE) in a target population — is set-identified, meaning only that θ ∈ Θ0(µ) can be established, where the identified set is indexed by µ. The decision maker confronts both ambiguity (arising from partial identification of θ given µ) and statistical uncertainty (µ must be estimated).
The authors propose a hybrid optimality criterion that applies minimax reasoning to the partially-identified parameter θ — choosing actions that minimize maximum risk over Θ0(µ) — while applying average (integrated) risk minimization over µ, reflecting the asymmetric nature of the two identification problems. This asymmetric treatment follows the generalized Bayes-minimax principle of Hurwicz (1951).
The optimal decision rule is implemented by computing, for each action, the maximum risk (or regret) over θ ∈ Θ0(µ) conditional on µ, then averaging this maximum risk across either (i) a bootstrap distribution for an efficient estimator µ̂, (ii) a posterior distribution for µ in parametric models, or (iii) a quasi-posterior based on a limited-information criterion in semiparametric models. The optimal action is whichever choice has the smallest average maximum risk.
A central theoretical result (Theorems 1 and 4) establishes formal asymptotic optimality for both parametric and semiparametric settings: Bayes and quasi-Bayes decisions with any prior whose density is positive, bounded, and continuous are asymptotically equivalent and optimal. Critically, the optimality of these rules is asymptotically independent of the choice of prior for µ. The authors also establish a necessity result (Theorems 2 and 5): any decision rule not asymptotically equivalent to the Bayes or bootstrap rule is strictly sub-optimal.
A key finding is that “plug-in” rules — which substitute an efficient point estimate µ̂ directly into the oracle decision rule — can be sub-optimal. This failure occurs generically under partial identification because the maximum risk function R(d,µ) is typically only directionally differentiable (not fully differentiable) in µ, owing to max and min operators in intersection bounds, linear program value functions, or other bound constructions. When full differentiability holds, Corollary 1 confirms plug-in rules are optimal; otherwise they are not. The empirical illustration demonstrates the practical consequence: for German male youths deciding whether to adopt a job-training program based on 14 RCT studies from Card, Kluve, and Weber (2017), the optimal rule recommends treatment (average quasi-posterior robust welfare contrast b̄n > 0) while the plug-in rule recommends against treatment (plug-in value b(µ̂) < 0). The lower bound maximum of µ̂k − C‖x0 − xk‖ is −0.3190 for the leading US study and −0.3298 for the second-best Brazilian study; because these two values are close relative to the average standard error of 0.034 across studies, the lower bound distribution is right-skewed (behaving like the maximum of two Gaussians), pushing b̄n positive even though b(µ̂) is negative.
The paper extends optimality theory to semiparametric models via a least favorable parametric submodel, introduces the concept of σ-optimality for cases where the average maximum risk criterion is infinite (relevant when the dimension K of µ exceeds 1), and provides detailed implementation guides for treatment assignment under intersection bounds, IV-like estimands, and non-separable panel data, as well as for optimal pricing decisions where revealed-preference demand theory bounds counterfactual demand responses via linear programming.
Scope conditions: optimality results apply to discrete action spaces, require efficient estimation of µ, require the identified set Θ0(µ) to be known as a set-valued mapping, and assume no “first-order ties” (the oracle decision is unique at µ0). The asymptotic framework is local, mimicking the finite-sample problem where µ is not known with certainty.
Q: What is the core decision problem this paper addresses?
A: A decision maker must choose from a finite set of actions D = {0, 1, …, D}. Payoffs depend on a structural parameter θ that is only set-identified — the data can establish θ ∈ Θ0(µ) but not pin down θ exactly. The reduced-form parameter µ is point-identified and estimated from data. The decision maker faces both ambiguity (which θ in Θ0(µ) is true?) and sampling uncertainty (what is µ?). The paper asks how to construct decision rules that are optimal in large samples under this dual uncertainty.
Q: What is the proposed optimality criterion, and why is it asymmetric across parameters?
A: The criterion applies minimax reasoning to the partially-identified θ — the maximum risk over Θ0(µ) given µ is the relevant loss — and integrates this maximum risk over µ using Lebesgue measure on local perturbations h = √n(µ − µ0) of a fixed µ0. The asymmetry reflects the fact that θ is not updated by the data (the prior for θ is not identified), while µ can be learned efficiently from the data. Full minimax over both (θ, µ) is rarely tractable even for simple binary treatment problems; the asymmetric approach yields tractable optimal rules for a broad empirically relevant class of settings.
Q: What are the Bayes, bootstrap, and quasi-Bayes implementations of the optimal rule?
A: In all three cases, the decision maker computes R̄n(d) — the average maximum risk for action d — and chooses the action that minimizes it. The Bayes rule averages R(d, µ) over the posterior πn(µ|Xn) for µ using Bayes’ theorem with a prior π on M. The bootstrap rule averages R(d, µ̂*) over bootstrap redraws µ̂* of the efficient estimator µ̂. The quasi-Bayes rule (for semiparametric models) uses a limited-information quasi-posterior N(µ̂, (nÎ)−1) combining a Gaussian quasi-likelihood with a prior for µ. All three implementations are asymptotically equivalent and optimal under the regularity conditions of Theorems 1 and 4.
Q: What do Theorems 1 and 2 (and their semiparametric analogues Theorems 4 and 5) establish?
A: Theorem 1 establishes sufficiency: Bayes decisions with any prior in the class Π are asymptotically equivalent to each other and are optimal; any rule asymptotically equivalent to such a Bayes decision is also optimal. Theorem 2 establishes necessity: any rule in the admissible class D that is not asymptotically equivalent to the Bayes rule has strictly higher average excess risk at any µ0 where asymptotic equivalence fails. Together, these theorems fully characterize the class of asymptotically optimal rules and show that the Bayes/bootstrap class is not merely sufficient but also necessary for optimality.
Q: When are plug-in rules sub-optimal, and when are they optimal?
A: Plug-in rules substitute an efficient point estimate µ̂ directly into the oracle decision δo(µ̂). If R(d, µ) is fully differentiable at µ0 for all oracle-optimal actions d, then the directional derivative is linear and plug-in and Bayes rules are asymptotically equivalent; Corollary 1 confirms plug-in rules are then optimal. However, under partial identification, max and min operators in bound constructions — intersection bounds, linear program value functions, revealed-preference bounds — generically induce only directional (non-linear) differentiability of R(d, µ). In these cases asymptotic equivalence can fail, and Theorem 2 implies plug-in rules are sub-optimal. Manski (2021, 2023) documents poor finite-sample performance of plug-in rules numerically; the authors’ necessity result provides a general theoretical explanation under the asymptotic average risk criterion.
Q: How does the treatment assignment empirical illustration demonstrate the difference between optimal and plug-in rules?
A: Using data from Ishihara and Kitagawa (2021) with K = 14 RCT studies from Card, Kluve, and Weber (2017) and Lipschitz constant C = 0.25, the decision is whether to adopt a job-training program for German male youths or female youths in 2010 (GDP growth 3.48%, unemployment 9.45%). For male youths, the largest lower bound value µ̂k − C‖x0 − xk‖ is −0.3190 (US study) and the second-largest is −0.3298 (Brazilian study), separated by only 0.0108 against an average standard error of 0.034 across studies, so the lower bound distribution is right-skewed (maximum of two near-tied Gaussians). This right-skew pushes the quasi-posterior mean b̄n positive, yielding a treatment recommendation, while the plug-in value b(µ̂) is negative, yielding a non-treatment recommendation — a concrete reversal of the policy decision. For female youths, the minima and maxima are better separated, the distribution is near-Gaussian, and b̄n ≈ b(µ̂), so both rules agree on treatment.
Q: What are intersection bounds and why do they generate directional differentiability?
A: Intersection bounds arise when the ATE is bounded in K separate observational studies by lower bounds bL,k(µk) and upper bounds bU,k(µk). The combined identified set uses bL(µ) = max_{1≤k≤K} bL,k(µk) and bU(µ) = min_{1≤k≤K} bU,k(µk). Even if each component bound is smooth in µk, the max and min operators make bL and bU only directionally differentiable (not fully differentiable) in µ. The directional derivative is positively homogeneous of degree one but non-linear, which is the property that drives the wedge between Bayes and plug-in rules.
Q: How does the paper extend to semiparametric models, and what technical tool does it use?
A: In semiparametric models, the data distribution depends on both µ ∈ R^K and an infinite-dimensional nuisance parameter η. Integrating over local perturbations of η as well as µ raises measure-theoretic problems in infinite-dimensional spaces. The authors instead restrict attention to local perturbations of µ0 within a least favorable parametric submodel, which is the direction that makes the problem hardest. The quasi-posterior N(µ̂, (nÎ)−1) is then used as the averaging distribution, combining a Gaussian quasi-likelihood with a prior for µ. Theorem 4 establishes optimality and Theorem 5 establishes necessity under these semiparametric conditions, mirroring the parametric Theorems 1 and 2.
Q: What is σ-optimality and why is it needed?
A: When the dimension K of µ exceeds 1, the integrated average excess risk criterion R({δn}; µ0) — which integrates over Lebesgue measure on R^K — may be infinite for all decision sequences in D, making the criterion uninformative. σ-optimality approximates the improper Lebesgue prior on h by a sequence of proper priors indexed by σ, and requires that the decision rule minimize the resulting criterion for all σ. Theorem 3 shows that the limiting behavior of σ-optimal rules coincides with that of the Bayes rule δ*n(·; π), preserving the practical implementation.
Q: How is the optimal pricing application structured and what role do revealed-preference bounds play?
A: A monopolist observes repeated cross-sections of individual demands across B budget sets and must choose a price vector from D = O ∪ C, where O contains observed prices and C contains counterfactual prices. For observed prices, average demand is identified; for counterfactual prices, only bounds are available. Following Kitamura and Stoye (2019), the space of goods is partitioned into GARP-compatible regions, and sharp bounds on counterfactual demand are computed by solving linear programs over the mass allocated to each region subject to GARP consistency constraints. The reduced-form parameter µ collects empirical choice probabilities across observed budget-region cells, estimated consistently by sample frequencies. The optimal pricing decision averages the linear-program bound solutions across quasi-posterior draws of µ.
Q: How does this approach relate to minimax and conditional Γ-minimax approaches?
A: Full minimax over (θ, µ) requires strong distributional assumptions and tractable finite-sample distributions; the authors note that no minimax treatment rule exists even for binary treatment with binary outcomes and estimated bounds. Conditional Γ-minimax (DasGupta and Studden, 1989; Giacomini, Kitagawa, and Read, 2021) fixes a prior for µ and takes minimax over the set of priors for θ conditional on µ; this is closely related to the authors’ approach but can be conservative when the marginal prior for µ varies. The authors’ framework fixes the marginal prior for µ and takes minimax over θ ∈ Θ0(µ) conditional on µ, which is shown to arise as the equilibrium of a two-player zero-sum game where adversarial nature chooses a prior for θ ∈ Θ0(µ) conditional on µ and the available data for µ.
Q: What is the technical contribution regarding directionally differentiable functions?
A: Hirano and Porter (2009) derived asymptotic optimality for treatment rules under fully differentiable welfare contrasts. This paper extends that theory to settings with directional (but not full) differentiability — a generic feature whenever bounds involve max/min operators or linear program values. The key technical building block is the asymptotic distribution of the quasi-posterior mean of directionally differentiable functions (Propositions 2 and 3 in Appendix C). While Kitagawa, Montiel Olea, Payne, and Velez (2020) characterized the asymptotic behavior of the posterior distribution of such functions, this paper instead characterizes the frequentist distribution of the posterior mean — a distinct and novel contribution to the literature on asymptotics for non-smooth functions (Dümbgen, 1993; Fang and Santos, 2019).
Q: What are the key scope conditions and limitations of the optimality results?
A: The action space D must be finite and discrete (continuous pricing must be approximated by a grid of whole-currency units, as noted in the introduction). The identified set mapping Θ0(·) must be known. Efficient estimation of µ is required, along with a consistent estimator of its asymptotic variance for quasi-Bayes implementation. The optimality criterion assumes “no first-order ties” — the oracle decision must be unique at µ0. The framework is asymptotic (local perturbations around a fixed µ0), and the theory is designed for settings where deriving exact finite-sample optimal rules is intractable. The results do not cover the case where θ affects the data distribution (only payoffs are partially identified, not identification of µ itself).
Partially-identified parameter (θ): A structural parameter — such as the ATE in a target population — about which the data can establish only set membership θ ∈ Θ0(µ), not a point value. The identified set Θ0(µ) is indexed by the point-identified reduced-form parameter µ.
Oracle decision (δo(µ)): The infeasible first-best decision that minimizes maximum risk over the identified set Θ0(µ) for a known value of µ. It serves as the benchmark against which practical rules are evaluated; any data-dependent rule can only do weakly worse.
Maximum risk (R(d, µ)): The supremum of risk r(d, θ, µ) = Eθ[l(d, Y, θ, µ)] over all θ ∈ Θ0(µ) conditional on µ. Under the regret criterion for binary treatment, R(0, µ) = (bU(µ))+ and R(1, µ) = −(bL(µ))−.
Robust welfare contrast (b(µ)): In the treatment assignment application, b(µ) = (bU(µ))+ + (bL(µ))−, whose sign determines the oracle decision: treat if b(µ) ≥ 0. The optimal rule replaces b(µ) with its quasi-posterior mean b̄n.
Directional differentiability: A function f : M → R^k is directionally differentiable at µ0 if limits of (f(µ0 + tn hn) − f(µ0))/tn exist for all sequences tn ↓ 0 and hn → h, yielding a directional derivative ḟµ0[·] that is positively homogeneous but not necessarily linear. Max/min operators and linear program value functions are generically only directionally differentiable, not fully differentiable. This property is what causes plug-in rules to fail.
Quasi-posterior: In semiparametric models, a posterior-like distribution for µ formed by combining a limited-information Gaussian quasi-likelihood N(µ̂, (nÎ)−1) with a prior π, yielding πn(µ|Xn) ∝ exp(−½(µ − µ̂)T(nÎ)(µ − µ̂))π(µ). Used in place of a full Bayesian posterior when the exact likelihood of the data-generating process is unavailable.
σ-optimality: An optimality concept that replaces the improper Lebesgue prior on local perturbations h ∈ R^K with a sequence of proper priors indexed by σ, used when the average excess risk criterion is infinite for K > 1. Theorem 3 establishes that the σ-optimal decision rule converges to the Bayes rule as σ → ∞.
Plug-in rule (δplug_n): A decision rule formed by substituting an efficient point estimate µ̂ directly into the oracle decision: δplug_n = δo(µ̂). Optimal when R(d, µ) is fully differentiable (Corollary 1), but generically sub-optimal under partial identification because directional differentiability of R(d, µ) breaks the asymptotic equivalence between the plug-in and Bayes rules.