D81 | Macro Paper Warehouse

Catastrophes, Delays, and Learning

Mon, 01 Jan 0001 00:00:00 +0000

This paper develops a general model of experimentation under catastrophe risk in which the catastrophe is triggered when a stock variable exceeds an unknown threshold, but occurs only after a stochastic delay. The central contribution is the concept of the “legacy of the past”: at any planning date, past experiments may have already triggered a catastrophe that has not yet materialized, and the planner cannot observe whether triggering has occurred. The legacy is formally defined as the probability, conditional on survival, that a catastrophe was triggered in the past.

The model unifies two canonical but previously incompatible approaches in the literature. In the hazard-rate approach, the catastrophe is bound to happen and the planner manages its timing and severity. In the unknown-threshold approach, learning is instantaneous and the catastrophe is certainly avoided if the stock has not yet exceeded the threshold. Neither approach captures the intermediate case where the planner remains uncertain about whether the catastrophe is already underway. By introducing a delay governed by an exponential distribution with parameter α, the authors show that both approaches are limiting special cases: as α → ∞ (no delay), the legacy vanishes and the unknown-threshold approach is recovered; when the legacy is set permanently to one (catastrophe triggered with certainty), the hazard-rate approach is recovered.

Three benchmark stock levels anchor the analysis. QN is the long-run target absent any catastrophe risk. QD (“Damages”) is the optimal stabilization target when the planner knows a catastrophe was triggered in the past — it lies weakly below QN because the planner trades off current gains against the discounted marginal damage from raising the stock at the moment of eventual catastrophe occurrence. QE (“Experimentation”) is the stock level below which stabilization is suboptimal when the planner is certain no triggering has occurred — it also lies weakly below QN.

The paper’s two main theorems are distinguished by the ranking of QD and QE, which reflects whether mitigation strategies are effective.

Theorem 1 (QE < QD): When damage is not highly sensitive to the stock level at catastrophe time — so mitigation is relatively ineffective — optimal paths are monotonically increasing and converge to a long-run stock level Q∞ ∈ [QE, QD]. The stopping condition equates the marginal benefit of experimentation to a weighted average of the expected cost under the unknown-threshold approach (weight 1 − π) and the marginal damage under the hazard-rate approach (weight π), where π is the legacy at stopping time. A higher legacy at the stopping time is associated with a higher long-run stock level. A higher initial legacy induces fatalism: since the catastrophe is more likely already triggered, the planner shifts priority toward current consumption rather than caution, leading to more total experimentation.

Theorem 2 (QD < QE): When damage is highly sensitive to the stock level — so mitigation is valuable — the long-run target is uniquely QE regardless of the initial legacy. However, the short-run path is non-monotonic: for a sufficiently high initial legacy, the planner first reduces the stock sharply (lockdown, emissions cut) to mitigate pending catastrophe damages, then, as the legacy declines because no catastrophe occurs, gradually allows the stock to rise back toward QE. The direction of caution reverses relative to Theorem 1: a higher legacy now induces more caution, not less.

Applications include pandemic management (stock = infected population, catastrophe = health system collapse) and climate change (stock = cumulative CO2 emissions or atmospheric pollution stock). In the disease control application, whether a planner prioritizes economic production or mortality reduction determines which theorem governs, with the key ratio being production losses relative to mortality increases. For pandemic policy, Theorem 2 produces a formal learning-based rationale for non-monotonic “hammer-and-dance” policies (strict early lockdown followed by relaxation) that differs from prior explanations in the literature. In the carbon budget application, Proposition 5 formally proves that higher initial legacy raises the optimal carbon budget under Theorem 1 conditions, and can imply unbounded consumption (certainty of catastrophe) above a critical legacy threshold π*. Under Theorem 2 conditions (Proposition 6), the optimal policy can involve first reducing then expanding the stock before stabilizing, with both transition dates increasing in the initial legacy.

Q: What is the “legacy of the past” and how is it computed? A: The legacy πt is defined as the probability, conditional on survival to date t, that a catastrophe was already triggered by past experiments. Formally, πt = 1 − [1 − F(Qt)] / pt, where Qt is the highest stock level ever reached, F is the prior distribution over the threshold, and pt is the survival probability. A past experiment at time t’ contributes to the current legacy with weight exp[−α(t − t’)], so recent experiments matter more than distant ones. As time passes without catastrophe, the legacy of any fixed past experiment declines geometrically at rate α.

Q: How do the three benchmark stock levels QN, QD, and QE relate to each other? A: QN is the optimal long-run stock without any catastrophe. QD is defined by the condition where the marginal net benefit of increasing the stock — ν(Q) − [α/(α+δ)]D’(Q) — equals zero, and satisfies QD ≤ QN. QE is defined by ν(Q) − [α/(α+δ)]ρ(Q)D(Q) = zero, and also satisfies QE ≤ QN. The ranking between QD and QE depends on whether damage is more sensitive to the marginal increase in stock at catastrophe time (which pushes QD below QE) or to the level of the stock at triggering (which pulls QD above QE).

Q: What is the key optimality condition in Theorem 1 and how does it unify prior approaches? A: The stopping condition (equation 15) states: ν(QT) = [α/(α+δ)] × [(1 − πT)ρ(QT)D(QT) + πT D’(QT)]. When πT = 0 (no legacy, unknown-threshold limit), this reduces to the experimentation stopping condition of Tsur and Zemel, governed by the hazard rate ρ(QT) times expected loss D(QT). When πT = 1 (full legacy, hazard-rate limit), it reduces to the damage-mitigation condition governed by marginal damage D’(QT). The legacy at stopping time thus serves as the mixing weight between the two canonical approaches, embedding both as special cases.

Q: How does the initial legacy affect total experimentation under Theorem 1 versus Theorem 2? A: Under Theorem 1 (QE < QD), a higher initial legacy π0 leads to more total experimentation (higher Q∞), because the planner becomes fatalistic — since the catastrophe is more likely already triggered and mitigation is relatively ineffective, current consumption is prioritized. Proposition 5 formally proves this for the carbon budget application: the optimal stopping date T and optimal budget QT are nondecreasing in π0. Under Theorem 2 (QD < QE), a higher legacy triggers more caution in the short run (larger reduction in the stock during the mitigation phase), but the long-run target QE remains the same regardless of π0.

Q: What generates non-monotonic policies in Theorem 2, and what does this look like in the pandemic application? A: Non-monotonicity arises because the optimal response to a high legacy is first to reduce the stock sharply to limit catastrophe damages (since damage is sensitive to the stock level), and then, as time passes without catastrophe and the legacy declines, to allow the stock to recover. In the disease control application with high mortality weight, a complete lockdown is optimal in the first phase whenever the legacy is strictly positive. As the legacy declines, the lockdown is gradually relaxed, and eventually the infection level returns to its pre-lockdown level. Figures 3 and 4 show that a higher initial legacy (π0 = 0.1, 0.5, or 0.9) leads to a longer lockdown and slower recovery, though all paths converge to the same long-run infection level.

Q: How does the model’s disease control application determine which theorem governs? A: Lemma 2 states that if 1 / [1 + (Y(r+d) − Y*) / (wµdI^D)] < ρ(I^D), then I^E < I^D and Theorem 1 applies; otherwise I^E > I^D and Theorem 2 applies. The key ratio is (Y(r+d) − Y) / (wµ*d), the production loss relative to mortality increase. A planner who weights economic activity heavily (large production loss ratio) falls under Theorem 1 and tolerates rising infections; a planner who weights mortality heavily falls under Theorem 2 and imposes an initial lockdown.

Q: What is the carbon budget result under Theorem 1 (Proposition 5)? A: Under the condition u1 > [α/(α+δ)]v0 (marginal consumption value exceeds discounted marginal damage), Theorem 1 applies and there exists a critical legacy threshold π* such that: below π*, the planner consumes maximally (qt = q-bar) until a finite date T and then stops, with QE < QT < QD; above π*, the planner consumes maximally forever, triggering the catastrophe with certainty. The stopping date T and the optimal budget QT are nondecreasing functions of initial legacy π0, formally proving that higher past emissions (captured through legacy) justify higher future carbon budgets in this model.

Q: What is the carbon budget result under Theorem 2 (Proposition 6)? A: Under condition u1 < [α/(α+δ)]v0, QD < QE and Theorem 2 applies. Starting from Q0 above QE, if π0 is small enough (specifically u1 > π0[α/(α+δ)]v0), the optimal policy is to stabilize the stock forever at Q0. Otherwise, there exist two finite dates t1 < t2, both increasing in π0, such that the planner first reduces the stock at maximum rate (qt = q-bar-negative) for t < t1, then expands at maximum rate for t1 < t < t2, then stabilizes at Q0 forever. The optimal carbon budget is Q0 in all cases, showing that the long-run target is independent of legacy under Theorem 2.

Q: How does the model relate to the hazard-rate literature formally? A: Papers such as Nordhaus and others that use an exogenous hazard rate h(Qt) for catastrophe — yielding survival probability pt = p0 exp(−∫h(Qτ)dτ) — are shown to be equivalent to the special case where the catastrophe was triggered in the past (legacy = 1 permanently). Their formulation corresponds to assuming α is constant and the legacy is identically one, which reduces the law of motion for pt to pt = p0 exp(−αt). The key difference is that in the hazard-rate approach the planner can reduce the arrival rate by lowering the stock (h is increasing in Q), whereas in the authors’ model the delay parameter α is constant and policy affects only damages.

Q: What is the role of the exponential delay distribution assumption? A: The assumption that the delay τ follows an exponential distribution with parameter α is made for tractability. Under this assumption, the entire past trajectory of the stock (Qt)t≤0 can be summarized by just two state variables — the highest stock on record Q0-bar and the initial legacy π0 — because the exponential “memoryless” property means that the additional expected waiting time until catastrophe occurrence does not depend on how long the triggering has already been in effect. Without this assumption, the full chronicle of past experiments would be required as a state variable, making the problem intractable.

Q: What happens when the delay parameter α approaches zero or infinity? A: When α → ∞ (instantaneous catastrophe upon triggering), pt = 1 − F(Qt) and the legacy is identically zero, recovering the Tsur-Zemel unknown-threshold approach (Proposition 3). The optimal path converges to QE0 from below or stabilizes if already above QE0. When α → 0 (infinite delay, effectively no catastrophe), QE = QD = QN and the problem reduces to the simple stock-flow problem (Proposition 1), with the optimal path converging monotonically to QN.

Q: Does the model allow for damage mitigation after triggering but before occurrence? A: Yes, this is a key feature. The continuation payoff after catastrophe occurrence is V(QT) where QT is the stock level at the time of occurrence T, not at triggering time T(S). This means the planner can reduce the stock after triggering to lower damages — analogous to a skater turning back toward shore after the ice first cracks. The assumption that V depends on the stock at occurrence rather than at triggering or at the maximum historical level is what allows this mitigation channel and is explicitly noted as a modeling choice.

Legacy of the past (πt): The probability, conditional on survival to date t, that past experiments have already triggered a catastrophe. Formally πt = 1 − [1 − F(Qt)] / pt. Recent experiments contribute more to the legacy than distant ones, with contribution decaying at rate α. The legacy is zero when α → ∞ and is the central state variable bridging the paper’s two canonical extremes.

QE (“Experimentation” threshold): The stock level at which the net marginal gain from further experimentation, defined as ν(Q) − [α/(α+δ)]ρ(Q)D(Q), equals zero, under the assumption that no catastrophe has been triggered. Below QE, stabilization is suboptimal; above QE, the planner does not experiment further when the legacy is zero.

QD (“Damages” threshold): The stock level at which the net marginal benefit from holding the stock, defined as ν(Q) − [α/(α+δ)]D’(Q), equals zero, under the assumption that the catastrophe is known to have been triggered. QD ≤ QN and represents the optimal long-run target when the hazard-rate approach applies.

Marginal payoff ν(Q): Defined as uq(0, Q) + (1/δ)uQ(0, Q), it measures the net gain from marginally increasing the flow when the stock is stabilized at Q. It is strictly decreasing in Q under Assumption 1 and equals zero at QN.

Damage function D(Q): Defined as (1/δ)u(0, Q) − V(Q), it measures the welfare loss from catastrophe occurrence when the stock is Q at occurrence time, relative to permanent stabilization at Q. Assumed weakly positive and weakly increasing in Q.

Survival probability (pt): The probability, computed from prior beliefs F at the beginning of times, that the catastrophe has not yet occurred by date t. Its law of motion is ṗt = α[1 − F(Qt) − pt], driven solely by the catastrophe parameter α and the current maximum stock Qt.

Fatalism (under Theorem 1): The policy implication that a higher legacy — meaning a higher probability the catastrophe is already triggered — leads the planner to increase the stock further and accept more experimentation, because mitigation is relatively ineffective (QE < QD) and current consumption must be enjoyed before the catastrophe arrives.

De Gustibus and Disputes about Reference Dependence

Mon, 01 Jan 0001 00:00:00 +0000

This paper examines whether heterogeneity in individual gain-loss attitudes — the degree to which people weigh losses more or less severely than equivalent gains — contaminates prior tests of expectations-based reference dependence (EBRD). The central question is: do prior experiments that appear to yield mixed or null evidence against EBRD actually reflect a failure of the expectations-based reference point, or instead reflect a methodological flaw — the implicit assumption that all individuals are uniformly loss averse?

All prior tests of EBRD models (e.g., Kőszegi and Rabin 2006, 2007) have proceeded under what the authors call “universal loss aversion,” the assumption that every individual weighs losses more heavily than commensurate gains (λ > 1). The authors argue that this assumption — a form of the classic De Gustibus conjecture — is empirically incorrect and theoretically distorting: within EBRD designs, loss-averse and gain-seeking subjects are predicted to respond in opposite directions to expectations manipulations, so aggregating across them suppresses or reverses treatment effects.

The authors run two pre-registered laboratory experiments totaling 1,524 subjects. The labor supply experiment (N = 500, UC San Diego) uses a two-stage design. Stage 1 elicits each subject’s gain-loss attitude parameter λ_i from their effort responses to fixed versus uncertain piece rates in a real-effort transcription task, exploiting the prediction that loss-averse workers reduce effort under wage uncertainty while gain-seeking workers increase it. Stage 2 manipulates expectations by varying the probability of a high outside payment (p = 0.05 in Condition Low vs. p = 0.45 in Condition High), holding the piece-rate probability constant at 50%; under EBRD, this shifts the reference point and should change effort in a direction governed by λ_i.

The exchange experiment (N = 1,024, University of Bonn, with a pre-registered 2018 replication of N = 417) uses Stage 1 preference statements over randomly endowed objects to estimate λ_i, and Stage 2 manipulates expectations via a 0% vs. 50% probability of forced exchange. Under EBRD, loss-averse subjects should become more willing to exchange in the High condition; gain-seeking subjects should become less willing.

Both experiments document substantial heterogeneity in gain-loss attitudes. In the labor supply study, approximately 70.6% of subjects exhibit loss aversion (λ̂ > 1) and 29.4% exhibit gain-seeking (λ̂ < 1), with an average structural estimate of λ̂ = 1.65 and median 1.66. In the exchange study, 76% are loss averse and 24% are gain-seeking, with mean λ̂ = 1.49 and median 1.34. Lottery-based elicitation in the labor supply experiment yields 28% gain-seeking, consistent with prior literature estimates of roughly 22% gain-seeking from Chapman et al. (2018).

Crucially, Stage 1 gain-loss attitudes are strongly predictive of Stage 2 treatment effects in both experiments. In the labor supply study, the aggregate treatment effect of approximately 26% greater effort in Condition High — reproducing Abeler et al. (2011) — masks strongly heterogeneous responses: higher λ̂ predicts larger positive treatment effects (raw correlation ρ = 0.18, p < 0.01), and controlling for heterogeneous gain-loss attitudes raises R² by more than a factor of 10. In the exchange study, the aggregate treatment effect is precisely zero (coefficient = 0.00, clustered s.e. = 0.03), a result that prior literature would interpret as contradicting EBRD; but once gain-loss heterogeneity is accounted for, treatment effects are strongly positive for loss-averse subjects and negative for gain-seeking subjects, again raising R² by more than a factor of 10.

Gain-seeking subjects exhibit negative treatment effects in the exchange study, consistent with EBRD predictions, but in the labor supply study the average treatment effect for gain-seeking subjects remains slightly positive, representing a partial deviation from the model’s quantitative predictions. The authors interpret this as evidence that expectations-based reference points are an important but likely incomplete determinant of behavior, with attention-based, status-quo-based, or anchoring-based reference points potentially playing supplementary roles.

Q: What is the central methodological problem with prior tests of expectations-based reference dependence?

A: All prior tests assumed universal loss aversion — that every individual has λ > 1, i.e., weighs losses more severely than equivalent gains. The authors show this is both empirically wrong (roughly 24–29% of subjects are gain-seeking across both studies) and theoretically distorting: within EBRD designs, gain-seeking individuals are predicted to respond in the opposite direction from loss-averse individuals, so averaging across heterogeneous types can suppress, zero out, or even reverse the true treatment effect. This makes standard aggregate tests of EBRD unreliable.

Q: How do the authors measure gain-loss attitudes in the labor supply experiment?

A: In Stage 1, subjects make 30 effort decisions across fixed piece rates and uncertain piece rates with the same mean. Under the Kőszegi-Rabin CPE model, a loss-averse individual reduces effort when the wage is uncertain (because outcomes can fall below the reference point), while a gain-seeking individual increases effort under uncertainty. The authors estimate individual-level parameters by regressing log(e_i + 10) on log(w) and Δw/w in a random-coefficients framework; the coefficient l̂_i on Δw/w is the reduced-form measure of gain-loss attitudes, with λ̂_i = 1 + 4·(l̂_i/ĝ_i) as the structural estimate. The correlation between the two measures is ρ = 0.85 (p < 0.01).

Q: How do the authors measure gain-loss attitudes in the exchange experiment?

A: In Stage 1, subjects are randomly endowed with one of two objects and provide three unincentivized preference statements (relative liking, relative wanting, and hypothetical choice) before any possibility of exchange is introduced. Under CPE, an individual endowed with object X will prefer X to the extent that (1 + λ_i) − 2(Y/X) > 0, so subjects with higher λ_i should more strongly favor their endowment. A principal components analysis reduces the three statements to one factor (capturing ~70% of variation), and residuals from regressing that factor on object assignment constitute the reduced-form measure l̂_i. The structural estimate λ̂_i is obtained via a mixed logit using a log-normal distribution for λ_i; the reduced form and structural measures are correlated at r = 0.95 (p < 0.01).

Q: What does the distribution of gain-loss attitudes look like across the two experiments?

A: In the labor supply experiment (N = 453 estimable subjects), 70.6% are loss averse and 29.4% are gain-seeking, with mean λ̂ = 1.65 and median λ̂ = 1.66. In the exchange experiment (N = 1,024), 76% are loss averse and 24% are gain-seeking, with mean λ̂ = 1.49 and median λ̂ = 1.34. A separate lottery-based elicitation in the labor supply study finds 28% gain-seeking subjects. These proportions are consistent with the weighted average of 22% gain-seeking found by Chapman et al. (2018) across seven prior lottery-choice studies.

Q: What is the aggregate treatment effect in the labor supply experiment, and what does it look like once heterogeneity is accounted for?

A: Without accounting for gain-loss heterogeneity, Condition High is associated with roughly a 26% increase in effort relative to Condition Low (individual-clustered s.e. = 0.03, p < 0.01), reproducing the Abeler et al. (2011) result and consistent with EBRD under universal loss aversion. However, R² = 0.03. Once interactions of Condition High with l̂_i and λ̂_i are included, R² rises to 0.40 and 0.39 respectively — more than a tenfold increase. Higher λ̂_i predicts larger positive treatment effects (raw correlation ρ = 0.18, p < 0.01), and the interaction of Condition High with λ̂_i is highly significant (F(1,452) = 49.14, p < 0.01).

Q: What is the aggregate treatment effect in the exchange experiment, and what does it look like once heterogeneity is accounted for?

A: Without heterogeneity, the treatment effect of Condition High on the probability of exchanging is precisely 0.00 (clustered s.e. = 0.03), which prior literature would read as a failure of EBRD. Once heterogeneity is introduced via interactions with l̂_i and λ̂_i, the pattern changes markedly: loss-averse subjects show positive treatment effects (greater willingness to exchange in High), while gain-seeking subjects show negative treatment effects (less willingness to exchange in High), consistent with Predictions 4–6. R² again rises by more than a factor of 10. In Condition Low, 38% of subjects exchange, reflecting a significant endowment effect (F(1,1022) = 25.66, p < 0.01).

Q: Why does the aggregate treatment effect in the exchange experiment equal zero?

A: The authors show in Appendix B.4 that the relationship between λ_i and exchange probability treatment effects can be concave — negative effects for gain-seeking subjects can be of greater absolute magnitude than positive effects for loss-averse subjects. With roughly 24% gain-seeking and 76% loss-averse subjects, aggregation can yield a near-zero average even when heterogeneous effects are substantial and directionally consistent with EBRD. This aggregation problem, not a failure of the expectations-based reference point mechanism, explains the null aggregate result.

Q: Do gain-loss attitudes measured in one domain predict behavior in another domain?

A: The lottery-based measure of gain-loss attitudes (from Multiple Price Lists administered after the real-effort task in the labor supply experiment) has mean λ̂ = 1.48 and median 1.42, with 28% gain-seeking subjects — proportions similar to the labor supply estimates. However, the correlation between the lottery-based and labor-supply-based structural estimates of λ̂ is only Pearson’s r = 0.091 (p = 0.03) and Spearman’s ρ = 0.084 (p = 0.075). Furthermore, the lottery measure has no predictive power for Stage 2 treatment effects. This suggests that while the prevalence of gain-seeking is similar across domains, gain-loss attitudes at the individual level are more domain-specific than prior work has appreciated.

Q: How do the authors address the “generated regressor problem” when using estimated λ̂_i as a regressor?

A: Since λ̂_i is itself estimated from Stage 1 data, using it directly as a regressor in Stage 2 regressions treats imprecise preference estimates as ideal data, which can distort inference (the Murphy-Topel problem). The authors address this by bootstrapping the entire pipeline — re-estimating gain-loss attitudes from Stage 1 in each of 500 bootstrap iterations and re-running the Stage 2 regressions — then reporting the average bootstrap coefficient and its standard deviation. The bootstrapped conclusions are qualitatively identical to the original regression results in both experiments.

Q: What limitations do the authors acknowledge in the EBRD model’s fit?

A: Even after accounting for heterogeneity, the EBRD model does not provide a complete quantitative account of behavior. In the labor supply experiment, gain-seeking subjects exhibit slightly positive average treatment effects (not negative as predicted), and loss-averse subjects’ empirical treatment effects fall short of theoretical predictions, despite a significant correlation between predicted and empirical treatment effects (ρ = 0.25, p < 0.01). The authors attribute these deviations to potential measurement error (which would attenuate estimated relationships), and to the possibility that reference points have multiple determinants — including status quo-based, attention-based, and anchoring-based factors — beyond expectations alone.

Q: What are the broader implications for other applications of gain-loss attitudes?

A: The paper’s findings have implications for any application that relies on universal loss aversion as a maintained assumption, including Rabin’s (2000) calibration argument for risk aversion at small and large stakes, insurance demand for small losses (Slovic et al., 1977), and preferences for bunched resolution of uncertainty (Kőszegi and Rabin, 2009). Admitting heterogeneity in gain-loss attitudes will require more nuanced predictions in each of these settings. The paper provides a methodology — measuring individual-level gain-loss attitudes within the experimental context of interest — for investigating and controlling for such heterogeneity.

Q: What design features prevent confounds between Stage 1 measurement and Stage 2 treatment in the exchange experiment?

A: Stage 1 uses a different pair of objects (USB stick and pens) than Stage 2 (picnic mat and thermos), or vice versa — each subject encounters each pair exactly once, with counterbalancing at the session level. Stage 1 preference statements are unincentivized and made before any possibility of exchange is introduced, so they do not contaminate the Stage 2 expectations manipulation. The random reassignment of objects at the end of Stage 1 generates exogenous variation in endowments, preventing mechanical confounds. The authors also verify that interpreting Stage 1 variation as reflecting heterogeneity in object valuations (rather than gain-loss attitudes) would predict zero heterogeneous treatment effects in Stage 2 — a prediction rejected by the data.

Expectations-Based Reference Dependence (EBRD): The formulation, due to Kőszegi and Rabin (2006, 2007), in which an individual’s reference point is the entire distribution of outcomes they rationally expected, rather than a fixed status quo. Behavior is governed by a Choice-Acclimating Personal Equilibrium (CPE) in which the chosen action is optimal given that the expectation of that action serves as the reference.

Gain-Loss Attitudes (λ_i): The individual-specific parameter governing how outcomes above versus below the reference point affect utility. Under piecewise-linear gain-loss utility, an outcome that falls short of the reference by z reduces utility by η·λ_i·z, while an outcome above it raises utility by η·z. Loss aversion is λ_i > 1; gain-seeking is λ_i < 1; loss neutrality is λ_i = 1. In this paper, λ_i is treated as heterogeneous across individuals rather than assumed uniform.

Universal Loss Aversion: The implicit homogeneity assumption maintained in all prior tests of EBRD — that every individual has λ > 1. The authors characterize this as a form of the De Gustibus Non Est Disputandum conjecture applied to gain-loss attitudes, and document that it fails empirically in both experimental settings.

Choice-Acclimating Personal Equilibrium (CPE): The rational expectations equilibrium concept from Kőszegi and Rabin (2006, 2007) used throughout the paper to derive comparative statics. A choice is a CPE if its expected utility given its own expectation as the reference exceeds the expected utility of any alternative given that alternative’s expectation as the reference.

Reduced-Form Gain-Loss Measure (l̂_i): In the labor supply context, the individual-level OLS coefficient on Δw/w in a log-effort regression — capturing how strongly a subject reduces (or increases) effort under wage uncertainty relative to a fixed wage of equal mean. A positive l̂_i identifies loss aversion; negative identifies gain-seeking. In the exchange context, the analogous measure is the residual from regressing the first principal component of Stage 1 preference statements on object assignment.

Aggregation Problem: The paper’s central methodological contribution — when gain-loss attitudes are heterogeneous and the EBRD treatment effect is non-linear in λ_i, the average treatment effect across a heterogeneous population need not equal the treatment effect at the average λ. In the exchange experiment, the aggregate treatment effect is precisely zero even though loss-averse and gain-seeking subjects each respond in the theoretically predicted (opposite) direction, because the concave relationship between λ_i and the exchange probability treatment effect causes negative gain-seeking effects to dominate in the aggregate.

Dynamic Concern for Misspecification

Mon, 01 Jan 0001 00:00:00 +0000

Layer 1 — Overview

Research Question

This paper asks how an agent who fears that none of their probabilistic models is the correct description of the data-generating process (DGP) should update that fear as evidence accumulates, and what long-run behavior such an agent exhibits. The central contribution is making the concern for misspecification endogenous: the better the agent’s structured models explain past observations, the less concerned the agent becomes.

Decision Criterion

The agent posits a finite-dimensional parametric set of structured models Θ, holds a prior µ over Θ, and evaluates each action according to an average robust control criterion. This criterion takes a weighted average (over models) of robust control assessments, where each assessment penalizes expected utility for probability distributions that deviate from the structured model in terms of relative entropy, scaled by a misspecification concern parameter λ > 0. A standard subjective expected utility maximizer is the limiting case as λ → 0 (no concern), and a maxmin agent is approached as λ → ∞.

Endogenous Misspecification Concern

The concern parameter λ is updated each period as a function of the likelihood ratio test (LRT) statistic of the structured models against unstructured alternatives, scaled by a time-normalizing sequence βₜ: λ(hₜ) = LRT(hₜ, Θ) / (2βₜ). The sequence βₜ determines how demanding the agent is in evaluating model fit.

Taxonomy of Agent Types

Three types emerge based on the speed of βₜ:

Statistician type (βₜ = ct, linear): applies a time scaling that keeps the LRT asymptotically informative about the degree of misspecification. This is the unique type satisfying both safety (long-run average payoff at least ε-close to the maxmin guarantee, almost surely) and consistency under almost correct specification (no ε-regret when misspecification is small).
Lenient type (t = o(βₜ)): attributes unexplained evidence to sampling variability; corresponds to the Law of Large Numbers intuition.
Demanding type (βₜ = o(t)): overly penalizes small discrepancies, analogous to the Law of Small Numbers fallacy (Tversky and Kahneman, 1971).

Standard SEU maximization fails safety; robust control with an invariant λ (Hansen and Sargent, 2001; 2022) fails consistency under almost correct specification.

Long-Run Convergence Results (Theorem 1)

For a misspecified agent (no θ ∈ Θ with qθ_{a*} = p*_{a*}), the nature of the limit action a* depends on the agent type:

Lenient type: a* is a Berk-Nash equilibrium — an SEU best reply to beliefs supported on the models with minimum relative entropy from the true DGP.
Demanding type: a* is a maxmin equilibrium — a worst-case best reply to all models absolutely continuous with respect to the true DGP.
Statistician type: if behavior converges, a* is a c-robust equilibrium — a robust control best reply to beliefs on the relative entropy minimizers, with the concern for misspecification endogenously set at minθ R(p*{a*} || qθ{a*}) / c.

For a correctly specified agent (Proposition 2), every limit action is a self-confirming equilibrium, regardless of the agent type.

Cycles and Limit Frequency (Section 4, Theorem 2)

The statistician type’s behavior need not converge. In natural settings, the agent cycles between actions: playing a “safe” action whose consequences are well-explained by Θ reduces concern for misspecification, eventually leading to a riskier action whose poorly-explained consequences raise concern again, inducing a return to the safe action. The paper proves that every limit frequency (empirical distribution over actions) is a mixed c-robust equilibrium — a generalization that allows mixing while tying the concern for misspecification to the frequency-weighted average relative entropy of each action.

Empirical Applications

Monetary policy cycles (Sargent 1999, 2008): In a central bank model where the true DGP includes increased inflation variability under aggressive policy (a feature absent from the bank’s structured models), no pure c-robust equilibrium exists for small c. The model predicts persistent cycles between conservative and aggressive policy. The frequency of the conservative policy is increasing in the strength of the exploitable inflation-unemployment trade-off (θ₁π + θ₁a).
Labor supply under complex tax schedules (Rees-Jones and Taubinsky, 2020): Agents with a “schmeduling” heuristic (linearizing the tax schedule) are misspecified. Berk-Nash equilibrium predicts these agents exert excess effort, with the bias increasing in the complexity (convexity) of the tax code. The c-robust equilibrium attenuates this bias: conditional on the equilibrium, minθ R(p*_a || qθ_a) > 0, so agents maintain positive concern for misspecification and pull back from the biased recommendation. The paper rationalizes the empirical finding that approximately 40% of agents hold the schmeduling belief but only about 20% fewer agents act on it — consistent with endogenous concern reducing the behavioral impact of the biased model.

Axiomatization (Section 5)

The paper axiomatizes the static average robust control criterion (Theorem 3) using: a Variational Axiom (from Maccheroni, Marinacci, and Rustichini, 2006a), a Structured Savage axiom (Sure-Thing Principle for bets on the model identity), an Intramodel Sure-Thing Principle (STP for bets conditional on the model), and Uniform Misspecification Concern (the agent is equally concerned about misspecification regardless of which model is identified as best-fitting). Three additional dynamic axioms characterize preference evolution: Constant Preference Invariance (utility index stable over time), Dynamic Consistency over Models (Bayesian updating over structured models), and Q-Likelihood (misspecification concern increases in the LRT). A novel Asymptotic Frequentism axiom characterizes the statistician type: preferences must become arbitrarily similar (in a precise quantitative sense) after sufficiently long histories with the same outcome frequency.

Layer 2 — Q&A

Q1: What is the average robust control criterion and how does it generalize prior decision criteria?

A: An agent evaluates action a by averaging over structured models θ a robust control assessment: for each θ, minimize expected utility over probability distributions within relative entropy distance (penalized by 1/λ) of qθ_a, then integrate over θ with prior µ. This nests SEU (λ → 0, perfect trust in models), standard robust control of Hansen and Sargent (2001) (µ is Dirac, single benchmark model), and maxmin expected utility of Gilboa and Schmeidler (λ → ∞). The key extension is allowing µ to be nondegenerate, so the agent is simultaneously uncertain about the best-fitting model and about whether any model is exact.

Q2: What is the role of the likelihood ratio test statistic in driving misspecification concern?

A: The LRT statistic compares the maximum likelihood of the structured models against the best unstructured alternative. It diverges almost surely when the agent is misspecified, regardless of how close the structured models are to the true DGP. The concern parameter λ(hₜ) = LRT(hₜ, Θ) / (2βₜ) uses a time-scaling sequence βₜ to keep this statistic interpretable. Without scaling, a misspecified agent’s concern would always explode to infinity.

Q3: Why does linear time scaling (βₜ = ct) uniquely characterize the statistician type as rational?

A: Proposition 1 establishes two properties: (1) ε-safety — every βₜ = ct-optimal policy achieves average payoff at least ε below the maxmin guarantee, almost surely; (2) ε-consistency under almost correct specification — for DGPs sufficiently close to Θ, the agent avoids long-run regret. Part 2 of Proposition 1 shows that no βₜ with βₜ = o(t) or t = o(βₜ) satisfies both properties simultaneously. SEU fails safety; invariant-λ robust control fails consistency.

Q4: What is a c-robust equilibrium and how does it differ from a Berk-Nash equilibrium?

A: A Berk-Nash equilibrium (Esponda and Pouzo, 2016) requires the action to be an SEU best reply to beliefs supported on the relative entropy minimizers of the true DGP. A c-robust equilibrium requires the same support condition but with the best reply taken under the average robust control criterion, where the concern for misspecification λ equals minθ R(p*{a*} || qθ{a*}) / c — that is, the minimum relative entropy scaled by 1/c. The endogenous λ is positive whenever the agent is misspecified, so the agent does not fully trust even the best-fitting model.

Q5: How does the paper explain that misspecified lenient types converge to Berk-Nash while demanding types converge to maxmin?

A: For the lenient type (t = o(βₜ)), the time scaling makes the concern for misspecification converge to 0 (the LRT grows slower than βₜ relative to t), so the agent effectively behaves as an SEU maximizer with beliefs on the KL-minimizing models — the Berk-Nash condition. For the demanding type (βₜ = o(t)), the LRT diverges relative to βₜ, so λ → ∞ and the agent’s preferences converge to worst-case evaluation over all models absolutely continuous with the true DGP — the maxmin condition. These are Theorem 1, parts 1 and 2.

Q6: Why does the statistician type exhibit cycles rather than convergence?

A: Section 4 and Corollary 1 show in the monetary policy application that no pure c-robust equilibrium exists for small c. Intuitively, the conservative policy (a=0) is a best reply to a high misspecification concern, but it produces outcomes well-explained by Θ, which drives concern down. The aggressive policy (a=1) is a best reply to a low concern, but it generates increased inflation variability not captured in Θ, which drives concern up sharply. There is no fixed point that is self-sustaining, so the agent cycles. Theorem 2 shows that the empirical frequency of actions still converges to a mixed c-robust equilibrium.

Q7: What are the quantitative comparative statics for the monetary policy cycles?

A: Corollary 1 establishes that there exists a threshold c̄ > 0 such that for all c ≤ c̄: (1) no pure c-robust equilibrium exists; (2) a mixed c-robust equilibrium exists; and (3) in the maximal and minimal equilibria, the frequency of the conservative policy α*(0) is increasing in θ₁π + θ₁a — a larger exploitable trade-off between inflation and unemployment implies more time spent on the aggressive policy.

Q8: How does the model rationalize the Rees-Jones and Taubinsky (2020) labor supply finding?

A: Rees-Jones and Taubinsky (2020) find that approximately 40% of agents have incentive-compatible beliefs consistent with the schmeduling heuristic (linearizing a convex tax schedule), but approximately 20% fewer agents act according to that heuristic. In a Berk-Nash equilibrium, the schmeduling agent exerts excess effort relative to the optimum; the more convex the tax code, the larger the excess. In a c-robust equilibrium, the agent retains a positive misspecification concern proportional to the deviation between the convex tax schedule and the linear approximation. Higher effort levels are more exposed to uncertainty in the marginal rate (the misspecified term θ+ε multiplies a higher average income z), so the concern for misspecification provides a natural force that reduces effort below the Berk-Nash prediction. The paper notes this finding is also consistent with an alternative interpretation in Rees-Jones and Taubinsky where all agents hold schmeduling beliefs but under-respond behaviorally.

Q9: What is the mixed c-robust equilibrium and why does it always exist?

A: A mixed c-robust equilibrium is a mixed action α* ∈ Δ(A) such that beliefs ν are supported on the relative entropy minimizers Θ(α*) — computed as the parameter minimizing the α*-weighted average relative entropy across actions — and every action in the support of α* is a best reply under the average robust control criterion with λ = minθ Σ_a α*(a) R(p*_a || qθ_a) / c. Proposition 3 proves existence by mapping this fixed-point condition to a Nash equilibrium in an auxiliary game between the agent and two adversarial Nature players, then invoking Reny (1999) on that game. A pure c-robust equilibrium need not exist, but mixing over actions allows the concern for misspecification to be calibrated to the frequency of poorly-explained actions.

Q10: How does Theorem 2 formally connect cycles to mixed c-robust equilibria?

A: Theorem 2 states that if βₜ = ct for all t and α* is a βₜ-limit frequency (i.e., the empirical action distribution converges to α* with positive probability under some optimal policy), then α* is a mixed c-robust equilibrium. The intuition is that when α* places weight on both a well-explained action and a poorly-explained action, the time-averaged relative entropy stabilizes at a fixed level, producing a stable endogenous concern for misspecification that makes the agent asymptotically indifferent between the actions in the support — sharply reducing the incentive to break the cycle.

Q11: What does the axiomatization contribute beyond the learning results?

A: The axiomatization (Section 5, Theorem 3) provides behavioral foundations observable from choices, without assuming the internal LRT mechanism. Two primary axioms pin down the average robust control criterion within the variational class: Structured Savage (Sure-Thing Principle for bets over model identity) and Uniform Misspecification Concern (equal concern for misspecification regardless of which model is revealed as best-fitting). Dynamic Consistency over Models pins down Bayesian updating. Q-Likelihood axiomatizes that the concern for misspecification is ordinally increasing in the LRT. The novel Asymptotic Frequentism axiom (Axiom 9) pins down the quantitative speed of adjustment: long histories with the same empirical frequency must induce asymptotically similar preferences, and Proposition 5 shows this implies λ_{hₜ} / (LRT(hₜ, Q) / (2tₙ)) converges to a finite limit — exactly the statistician type’s linear scaling.

Q12: What is the correlation between behavioral biases that the model predicts?

A: The paper derives three novel empirical predictions about the cross-sectional and time-series correlation of uncertainty attitudes: (1) long-run uncertainty aversion positively correlates with initial misspecification and with belief in the Law of Small Numbers; (2) these correlations are causal — repeated model failures and overly demanding evaluation induce a shift toward cautious behavior; (3) even holding misspecification and probability reasoning fixed, limit uncertainty attitudes are stochastic, depending on whether the limit action’s outcomes are well-explained by the structured models.

Q13: How does Example 2 (Correlation Neglect) show that endogenous concern can amplify rather than attenuate biases?

A: In a double auction, a buyer who mistakenly treats their own valuation and the ask price as independent (Correlation Neglect, Esponda, 2008) bids below the optimum in Berk-Nash equilibrium. In a c-robust equilibrium, the positive correlation between valuations and prices produces a strictly positive minθ R(p*{a*} || qθ{a*}), so the agent maintains misspecification concern. Since lower bids are accepted with lower probability (and thus are less sensitive to model misspecification), the endogenous concern drives the agent to bid even lower — amplifying the bias rather than attenuating it. This example illustrates that the direction of the correction depends on the geometry of how the misspecification interacts with the payoff structure.

Key Concepts

Average Robust Control Criterion: The decision criterion proposed in the paper. An agent evaluates action a by taking the expectation over structured models θ (with prior µ) of min_{p_a ∈ Δ(Y)} [E_{p_a}[u(a,y)] + (1/λ) R(p_a || qθ_a)]. This is a weighted average of robust control assessments, each penalizing distributions that deviate from a structured model in relative entropy. The parameter λ > 0 governs the intensity of misspecification concern, with SEU as the limit at λ → 0 and maxmin at λ → ∞.

Endogenous Misspecification Concern: Unlike prior robust control models where λ is fixed or set externally, here λ(hₜ) = LRT(hₜ, Θ) / (2βₜ) is a function of how well the structured models explain the observed history hₜ via the likelihood ratio test statistic. The better the models explain past data, the smaller λ becomes and the less the agent hedges.

Statistician Type: An agent who scales the likelihood ratio test statistic with a linear time sequence βₜ = ct for some c > 0. This is the unique agent type satisfying both ε-safety (guaranteed long-run average payoff above the maxmin guarantee minus ε) and ε-consistency under almost correct specification (no long-run regret when misspecification is small). The statistician type’s linear scaling is the only one for which the LRT statistic retains asymptotic informativeness about the degree of misspecification.

c-Robust Equilibrium: A fixed-point concept for the long-run behavior of the statistician type. Action a* is a c-robust equilibrium if it is an average robust control best reply to beliefs supported on Θ(a*) = argmin_θ R(p*{a*} || qθ{a*}), with misspecification concern λ = minθ R(p*{a*} || qθ{a*}) / c. This generalizes Berk-Nash equilibrium by incorporating an endogenous hedging motive proportional to the minimum relative entropy between the true DGP and the best structured model.

Mixed c-Robust Equilibrium: A generalization of c-robust equilibrium to mixed actions α* ∈ Δ(A) for environments where no pure equilibrium exists. The beliefs are supported on the models minimizing the α*-weighted average relative entropy, and the misspecification concern is tied to that average entropy. Every βₜ-limit frequency is a mixed c-robust equilibrium (Theorem 2). This concept characterizes the long-run time-average behavior when the statistician type cycles.

Law of Small Numbers (LSN) Type / Demanding Type: An agent for whom βₜ = o(t), meaning the time scaling grows sub-linearly. This agent is excessively sensitive to early model failures (analogously to the Law of Small Numbers fallacy of Tversky and Kahneman, 1971, where short-run frequencies are treated as the long-run norm). The long-run behavior of such a type converges to maxmin behavior rather than robust control.

Asymptotic Frequentism (Axiom 9): A novel axiom requiring that conditional preferences after sufficiently long histories with the same empirical outcome frequency must be arbitrarily similar (in a quantitative sense defined by measuring rods x, y, E) to a limiting preference. This axiom axiomatically pins down the statistician type’s linear time scaling: it implies that the ratio λ_{hₜ} / (LRT(hₜ, Q) / (2t)) converges to a finite limit c, exactly characterizing βₜ = ct.

Berk-Nash Equilibrium: The equilibrium concept (Esponda and Pouzo, 2016) that describes the long-run behavior of lenient (SEU) agents learning under misspecification. An action a* is a Berk-Nash equilibrium if it is an SEU best reply to beliefs supported on Θ(a*) — the KL-minimizing models — without any additional hedging against misspecification. The current paper shows that lenient types converge to Berk-Nash equilibria, while statistician types converge to c-robust equilibria that differ by incorporating a positive misspecification concern.

Efficiency Criteria, Income Taxation, and Heterogeneous Elasticities

Mon, 01 Jan 0001 00:00:00 +0000

Overview

Research Question. Can income tax schedules be justified as utilitarian-optimal without adopting extreme normative assumptions about how household welfare should be measured? The paper proposes a welfare criterion strictly stronger than Pareto efficiency—called rationalizability with bounded curvature—and asks whether observed US income taxes satisfy it.

Starting Point. Any Pareto-efficient nonlinear income tax schedule can, in principle, be rationalized as utilitarian-optimal under some cardinalization of household utilities (i.e., some choice of how to measure the cardinal scale of each household’s well-being). However, the paper shows that rationalizing Pareto-efficient taxes in this way often requires cardinalizations under which there is no population upper bound on the curvature of utility with respect to consumption. Equivalently, a utilitarian planner’s marginal willingness to transfer resources to households must fall arbitrarily quickly with the size of those transfers—an extreme form of status quo bias violated by virtually all quantitative optimal-tax exercises.

The Proposed Criterion. The authors restrict attention to cardinalizations with locally bounded curvature: there exists a finite (though potentially arbitrarily large) upper bound on the coefficient of relative risk aversion across the population. This admits two interpretations: (i) ex post, it requires that the social value of transfers not change arbitrarily quickly with transfer size; (ii) ex ante, it corresponds to a decision-maker behind a veil of ignorance with bounded risk aversion.

Main Theoretical Result. Within a standard Mirrlees model of nonlinear income taxation with arbitrary preference heterogeneity and intensive-margin labor supply, the paper proves that a tax schedule can be rationalized with bounded curvature if and only if government revenues are both decreasing and concave (not merely decreasing) with respect to a class of narrowly targeted “two-bracket” reforms—reforms that raise retention by $1 local to some income level $z$ and zero elsewhere. This contrasts with Pareto efficiency, which requires only that revenues be decreasing in these reforms (Bierbrauer, Boyer, and Hansen 2023). The additional requirement of revenue concavity is what distinguishes the bounded-curvature criterion from pure Pareto efficiency.

Sufficient Statistics. The paper derives explicit sufficient-statistics expressions for the first- and second-order derivatives of tax revenue with respect to these targeted reforms. The second derivative depends on higher moments of the elasticity distribution, specifically the income-conditional variance of compensated elasticities of taxable income (ETIs). Revenue convexity—which causes the second-order condition to fail—arises when income-conditional ETI variance is sufficiently high, even holding the mean ETI fixed. The economic mechanism is a “sort-and-extort” dynamic: a small tax reform sorts higher-elasticity households into income brackets where marginal taxes fall and lower-elasticity households into brackets where marginal taxes rise; repeating the reform then exploits this sorting by differentially taxing households by elasticity, as if applying group-specific tax schedules within a uniform income tax.

Empirical Findings. Using the NBER panel of US tax returns from 1979 to 1990, the paper estimates income-conditional mean ETIs of approximately 0.2–0.3 at most income levels. Crucially, it estimates a lower bound on income-conditional ETI variance by comparing elasticities of light versus heavy itemizers (defined by whether a household claims above or below the mean value of deductions in its income bracket). The low-elasticity group has an ETI of approximately zero and the high-elasticity group has an ETI of approximately one, implying a lower bound on ETI variance of roughly 0.2 at most incomes and approximately 0.25 at the top of the distribution. This lower bound is close to—and under plausible assumptions above—the threshold required for the second-order condition to fail. The authors conclude that the US income tax schedule in 1990 was likely Pareto efficient but likely not rationalizable with bounded curvature.

Quantitative Welfare Gains. In a calibrated model with a 50% top marginal tax rate, Pareto-tail shape of 2.5, mean ETI of 0.3, and ETI standard deviation of 0.75 (50% above the estimated lower bound), the planner gains significant welfare from either raising or lowering top marginal taxes. The welfare-maximizing top rate below the baseline is 13.3%, generating social value equivalent to a transfer of $1,966 per top earner. The welfare-maximizing top rate above the baseline is 71.2%, generating social value equivalent to a transfer of $972 per top earner. The revenue-maximizing rate is 80.9% under the baseline calibration, ranging from 74.6% to 86.8% as ETI standard deviation varies by ±25% of the lower bound.

Scope Conditions. The theoretical analysis is restricted to intensive-margin labor supply (abstracting from extensive-margin decisions); the empirical application focuses on top incomes where extensive-margin effects are likely small. The empirical period is 1979–1990, covering major federal and state tax reforms. Results concern local efficiency of the tax schedule, not global optimization.

Q&A

Q1: What exactly is “rationalizability with bounded curvature” and how does it differ from Pareto efficiency? A: Pareto efficiency requires that no small reform makes someone better off without making anyone worse off. Rationalizability (with any cardinalization) is equivalent to Pareto efficiency in this setting. Rationalizability with bounded curvature additionally restricts the cardinalization: there must exist a finite upper bound on the coefficient of relative risk aversion (or equivalently, on the curvature of utility with respect to consumption) across the population. This is a strictly stronger criterion than Pareto efficiency. A schedule can be Pareto efficient but not rationalizable with bounded curvature if the only cardinalizations that rationalize it require unbounded consumption utility curvature.

Q2: Why do “extreme” cardinalizations with unbounded curvature arise when rationalizing Pareto-efficient taxes? A: When a Pareto-efficient schedule is rationalized as utilitarian, the cardinalization must make the set of feasible, recardinalized utilities convex so it can be separated from the set of Pareto-improving allocations. The paper constructs such a cardinalization explicitly: it takes the form of a function whose second derivative approaches negative infinity as utility approaches its baseline value. This implies the planner’s marginal value of transfers to a household falls precipitously as the household is made even slightly better off—an extreme status quo bias. Theorem 2.b establishes that all cardinalizations rationalizing a schedule with convex revenues must share this pathology.

Q3: What is the “sort-and-extort” mechanism and how does it generate revenue convexity? A: When elasticities of taxable income (ETIs) are heterogeneous within an income level and the income density is declining steeply, a reform that lowers marginal taxes around income $z$ brings more households into the local bracket (because there are more households just below $z$ than above). Crucially, it disproportionately attracts households with higher ETIs, since they respond more strongly to the marginal tax cut and relocate from further away, where the density differs more. Repeating the reform therefore faces a higher-elasticity composition at $z$, generating larger positive behavioral effects—making revenues convex in the size of the reform. The second step (“extort”) involves raising taxes on the now-concentrated low-elasticity households at adjacent brackets, achieving as-if group-specific taxation within a single income tax schedule.

Q4: What is the precise relationship between revenue convexity and ETI variance? A: The paper shows (Theorem 4) that the second-order revenue derivative with respect to a narrow two-bracket reform around income $z$ equals a positive function of the income density times the expression $-[1-R’_0(z)]\varepsilon(z) + [1-R’_0(z)]\alpha(z)[\varepsilon^2(z) + \text{var}_h[\varepsilon^h | z^h_0=z]]$. The first term is always negative (pushing toward revenue concavity). The second term, which includes the income-conditional variance of ETIs, can dominate and create revenue convexity when ETI variance is sufficiently large. In the benchmark case with a single household type at each income (no within-income heterogeneity), the variance term vanishes and revenues are always concave whenever decreasing.

Q5: What is the sufficient statistics test for rationalizability at the top of the income distribution? A: At top incomes (assuming no income effects, no super-elasticities, and CES preferences), taxes are Pareto efficient if and only if $\tau_\text{top} < \frac{1}{1+\alpha_\text{top}\varepsilon_\text{top}}$, and they are rationalizable with bounded curvature if and only if additionally $\tau_\text{top} < \frac{2}{1+\alpha_\text{top}(\varepsilon_\text{top} + \sigma^2_\text{top}/\varepsilon_\text{top})}$, where $\tau_\text{top}$ is the top marginal tax rate, $\alpha_\text{top}$ is the Pareto tail shape, $\varepsilon_\text{top}$ is the mean ETI at the top, and $\sigma^2_\text{top}$ is the income-conditional ETI variance at the top.

Q6: How does the paper estimate a lower bound on income-conditional ETI variance? A: The authors divide households at each income level into “heavy” and “light” itemizers based on whether their total deductions exceed the local income-bracket mean. They then estimate group-specific ETIs using local polynomial regressions of log income changes on log marginal retention changes, interacting tax changes with heavy-itemizer indicators. The within-year difference in elasticities between groups provides a lower bound on within-income ETI variance, since the two-group decomposition captures only a fraction of true variance. The interaction coefficient is allowed to vary by year to isolate within-year, within-income variation in elasticities rather than between-year compositional changes.

Q7: What are the estimated magnitudes of mean and variance of ETIs? A: Income-conditional average ETIs are estimated at between 0.2 and 0.3 at most income levels, consistent with but somewhat below prior literature estimates. The low-elasticity group (light itemizers) has an ETI of approximately zero, while the high-elasticity group (heavy itemizers) has an ETI of approximately one. Given roughly equal group sizes, this implies a lower bound on ETI variance of approximately 0.2 at most incomes and approximately 0.25 at the ninety-fifth percentile. Subdividing the high-elasticity group into two, three, and four subgroups yields a lower bound of approximately 0.25 for variance at the top.

Q8: How does the back-of-the-envelope calculation work to assess whether the second-order test fails? A: With $\tau_\text{top} \approx 0.5$, $\alpha_\text{top} \approx 2.5$, and $\varepsilon_\text{top} \approx 0.3$ (from prior literature), the second-order condition fails if and only if ETI variance exceeds approximately 0.27. The authors’ lower bound estimate of ETI variance is already approximately 0.25 (standard deviation approximately 0.5), just below this threshold. The authors note that if the true standard deviation exceeds the lower bound by more than 4%, the second-order condition fails, making it empirically likely that the 1990 US tax schedule was not rationalizable with bounded curvature.

Q9: Why does the paper focus on the top of the income distribution for the empirical test? A: The second-order condition is most likely to fail at high incomes for three reasons simultaneously: (i) the marginal tax rate is highest, (ii) ETI means are somewhat higher there, and (iii) the Pareto parameter $\alpha(z)$ is largest (income density falls steeply), which amplifies the sort-and-extort mechanism. The authors also note that extensive-margin labor supply responses—which are abstracted away in the theory—are likely small at high incomes.

Q10: What does the calibrated quantitative application reveal about optimal top tax policy? A: Calibrated with a 50% initial top marginal tax rate, Pareto tail shape of 2.5, mean ETI of 0.3, and ETI standard deviation of 0.75 (50% above the estimated lower bound), the model finds welfare gains in both directions of reform. The welfare-maximizing rate below the baseline is 13.3%, yielding equivalent welfare gains of $1,966 per top earner. The welfare-maximizing rate above the baseline is 71.2%, yielding equivalent gains of $972 per top earner. The revenue-maximizing rate is 80.9%, ranging from 74.6% to 86.8% when ETI standard deviation varies by ±25% of the lower bound. This sensitivity highlights that the optimal direction and magnitude of reform depend substantially on the uncertain degree of ETI heterogeneity.

Q11: How does the paper relate to the “inverse optimum” literature? A: The inverse optimum approach (Bourguignon and Spadaro 2012; Hendren 2020) infers the first-order welfare trade-offs implicit in an observed tax schedule. This paper goes further by inferring from second-order empirical moments—specifically the income-conditional ETI variance—whether taxes are consistent with minimal requirements on how sensitive the planner’s trade-offs are to household welfare levels. Rather than assuming a welfare function, it tests whether any welfare function with bounded curvature can rationalize the observed schedule.

Q12: Is revenue convexity possible without within-income heterogeneity in preferences? A: Yes, but only under more specific conditions. The paper provides two supplemental examples. In the first, all households have constant-elasticity labor disutility but differ in both productivity and elasticity across income levels; when lower-income households have higher elasticities, a reform reducing marginal taxes at $z$ attracts higher-elasticity households and raises the average elasticity, leading to convex revenues. In the second, all households have the same initial elasticity but individual elasticities change in response to reforms. However, with the standard additively separable CES preferences and no within-income heterogeneity, revenues are always concave when decreasing—consistent with Werning’s (2007) observation that the Pareto planner’s problem is convex in this case.

Q13: What is the role of random tax reforms in the paper’s logic? A: Random tax reforms serve as an expository bridge. The paper shows that if the second-order revenue effect of a two-bracket reform is positive at some income $z$, then a “randomized” reform that applies the reform with equal probability in positive and negative directions generates an expected Pareto improvement—because the convexity of revenues implies expected revenues rise, while for any household with bounded risk aversion the reform’s second-order utility effect is also positive when the reform is sufficiently narrow. This establishes that revenue convexity implies random Pareto inefficiency under bounded risk aversion, and then the paper shows the analogous deterministic result for rationalizability.

Q14: What scope conditions attach to the sufficient conditions for rationalizability (Theorem 3)? A: Theorem 3 requires Assumptions 1 and 3 plus two boundary conditions: the ratio $\delta\text{Rev}(z)/(zg(z))$ must remain bounded away from zero as income approaches 0 or infinity, and at all incomes there must exist households with low enough compensated elasticities. Assumption 1 requires that average and marginal taxes have upper bounds below one, that marginal taxes have a lower bound, and that $zg(z)$ converges to zero at the boundaries. Assumption 3 is a regularity condition on how conditional moments of the elasticity distribution vary with income. These conditions ensure that the narrow, self-financing reforms considered in the necessity proof cannot generate welfare improvements once revenues are both decreasing and concave.

Key Concepts

Rationalizability with Bounded Curvature. The property that a tax schedule is utilitarian-optimal under some cardinalization of household utilities in which there exists a finite (though potentially arbitrarily large) upper bound on the curvature of utility with respect to consumption across the population. Formally, there exists a continuous function $\bar{\rho}$ such that, for all households, the absolute value of $[w_h \circ u_h]_{cc} / [w_h \circ u_h]_c$ is bounded by $\bar{\rho}$ evaluated at the household’s income. This criterion is strictly stronger than Pareto efficiency and strictly weaker than utilitarian optimality under a fixed cardinalization.

Two-Bracket Reform. A targeted tax reform that increases retention (post-tax income) by $1 at incomes local to some level $z$ over a small bracket of width $\ell$, and zero elsewhere (smoothed at the edges). As $\ell \to 0$, this becomes an infinitesimally narrow reform. The first- and second-order revenue effects of these reforms—denoted $\delta\text{Rev}(z)$ and $\delta^2\text{Rev}(z)$—are the paper’s key objects: Pareto efficiency requires $\delta\text{Rev}(z) < 0$ for all $z$, and rationalizability with bounded curvature additionally requires $\delta^2\text{Rev}(z) \leq 0$ for all $z$.

Income-Conditional ETI Variance. The variance of compensated elasticities of taxable income (ETIs) among households with the same income level, $\text{var}_h[\varepsilon^h | z^h_0 = z]$. This is the paper’s primary empirical object of interest and the key determinant of whether revenues are convex or concave in the size of targeted reforms. Unlike the literature’s focus on mean ETIs by income bracket, this within-income variance captures heterogeneity among households sharing the same pre-reform income.

Sort-and-Extort Mechanism. The two-step economic mechanism underlying revenue convexity from ETI heterogeneity. In the first step (“sort”), a marginal tax cut around income $z$ disproportionately attracts higher-ETI households from lower incomes (because they respond more strongly and relocate from further away), shifting the elasticity composition at $z$ upward. In the second step (“extort”), repeating the reform finds higher-elasticity households concentrated where marginal taxes fall and lower-elasticity households where taxes rise, effectively applying differential tax treatment by elasticity within a single income tax schedule.

Local Pareto Parameter $\alpha(z)$. Defined as $-d\log(zg(z))/d\log z$, where $g(z)$ is the income density. This captures the rate at which the income density is falling in income locally at $z$, and governs the strength of the sort-and-extort mechanism. High $\alpha(z)$ at top incomes (reflecting a steeply declining Pareto-type density) amplifies revenue convexity from ETI heterogeneity.

Super-Elasticity. A concept that captures how a household’s compensated ETI would change if its income were different, holding preferences fixed. Formally, it is the derivative of the household’s elasticity with respect to its log income, decomposing into effects from changes in preference curvature and changes in the local curvature of the tax schedule. Super-elasticities are zero in the benchmark case of additively CES preferences and locally CES retention schedules but contribute additional terms to the second-order revenue expression in the general case.

Cardinalizing Function. A strictly increasing function $w_h$ that maps household $h$’s indirect utility $V_h$ to a cardinalized utility level $w_h(V_h)$. The social planner maximizes the expectation of cardinalized utilities. Different choices of ${w_h}_h$ correspond to different stances on interpersonal comparisons, including unbounded curvature (rationalizing any Pareto-efficient schedule) or bounded curvature (the paper’s proposed restriction). Rawlsian social welfare is a limit of utilitarian welfare with increasingly concave cardinalizing functions.

Eliciting Multiple Prior Beliefs

Mon, 01 Jan 0001 00:00:00 +0000

Multiple prior decision models—in which beliefs are represented by a set of probability measures rather than a single measure, generating a probability interval for each event—have become increasingly important in economics, but choice-based incentive-compatible elicitation of probability intervals remains an open problem: existing scoring rules and matching-probability methods cannot recover probability intervals without assuming probabilistic sophistication that is precisely least warranted in settings where multiple priors are most relevant. This paper develops a preference-based identification of a subject’s probability interval for an event, and a method for eliciting it under weak decision-theoretic assumptions with no need for probabilistic sophistication. Three incentivized experiments on artificial and natural sources of uncertainty demonstrate that the elicited intervals are sensitive to the direction and amount of information, are typically consistent with objective probabilities where available, and exhibit a predominance of non-degenerate probability intervals that are wider when there is less information or predictability. On aggregate, the choice-based intervals are similar to stated probability intervals, providing behavioral foundations for the use of stated interval techniques in the field.

Summary of a forthcoming paper, AI-assisted and human-reviewed. See the linked original for the authoritative claims and full conditions.

In depth

Q1. What is the key identification challenge for multiple prior elicitation?

The key challenge is that existing incentive-compatible elicitation methods—scoring rules and matching-probability approaches—confound a subject’s probability interval with their ambiguity attitude, so they cannot separately identify the probability interval without assuming probabilistic sophistication. Under the popular α-maxmin EU model, the matching probability of an event depends on both the subject’s probability interval and their ambiguity attitude parameter α; even eliciting both the event and its complement’s matching probabilities yields two equations in three unknowns. Probabilistic sophistication is least warranted precisely in settings with deep uncertainty where multiple priors are most relevant, making precision-laden methods unsuitable.

Q2. What is the paper’s elicitation solution?

The paper develops a preference-based method that identifies a subject’s probability interval under weak decision-theoretic assumptions—with no need for probabilistic sophistication—using a series of incentivized choices, and demonstrates its feasibility in three laboratory experiments. The approach comprises two components: (i) a preference-based identification theorem establishing the conditions under which the probability interval can be recovered from observable choices; and (ii) a concrete elicitation procedure that is incentive compatible and does not impose the precision-laden assumption of probabilistic sophistication.

Q3. What do the experiments show?

Three incentivized experiments on artificial and natural sources of uncertainty demonstrate that probability intervals elicited by the method are sensitive to the direction and amount of information, are typically consistent with objective probabilities where available, and predominantly non-degenerate—with intervals wider when there is less information or predictability. The sensitivity to information and consistency with objective probabilities provide external validation that the elicited intervals capture real beliefs rather than noise or confusion. The predominance of non-degenerate intervals (rather than point probabilities) indicates that subjects genuinely hold imprecise beliefs in the relevant settings.

Q4. What is the relationship between choice-based and stated probability intervals?

On aggregate, probability intervals elicited with the choice-based method are similar to those stated by subjects, suggesting that the new method can provide behavioral foundations for the use of stated probability-interval techniques that are widely used in field surveys but previously lacked incentive-compatible grounding. This convergence is informative because stated intervals are cognitively simpler and can be collected at large scale in surveys, while the choice-based intervals are theoretically grounded; the consistency between them justifies the use of simpler stated methods in field applications.

Key concepts

multiple priors : a model of beliefs in which a decision maker’s uncertainty is represented by a set of probability measures rather than a single measure; associated with the Gilboa-Schmeidler (1989) maxmin expected utility model and its generalizations; generates a probability interval for each event. probability interval : the interval [p(E), p̄(E)] of probability values a subject’s set of priors assigns to event E; non-degenerate (with width > 0) when the subject’s beliefs are genuinely imprecise. incentive-compatible elicitation : an elicitation procedure in which subjects’ optimal strategy is to report their true beliefs; for Bayesian single-prior beliefs, achieved by scoring rules and matching-probability methods, but these fail for multiple priors. probabilistic sophistication : the assumption that a multiple-prior agent’s set of priors is generated by precise probabilistic beliefs; existing methods require this assumption to disentangle the probability interval from ambiguity attitude, but the paper’s method does not.

Linking Social and Personal Preferences: Theory and Experiment

Mon, 01 Jan 0001 00:00:00 +0000

This paper asks whether an individual’s attitude toward risk in the personal domain (choices affecting only oneself) can be linked to that same individual’s attitude toward risk in the social domain (choices affecting both oneself and others). The authors provide a theoretical answer in the form of necessary and sufficient conditions, and then test those conditions experimentally.

The formal model posits a decision maker (DM) with a preference relation over lotteries on a set of social states, where a distinguished subset of states are personal (consequences for the DM alone). The authors assume preferences satisfy Completeness, Transitivity, Continuity, and State Monotonicity — the last being equivalent to respect for First-Order Stochastic Dominance (FOSD), a condition weaker than the Expected Utility Independence Axiom and satisfied by virtually all extant decision theories including Weighted Expected Utility, Rank-Dependent Utility, and Prospect Theory. The key theoretical result (Theorem 1) establishes that the full preference relation over all social lotteries can be uniquely deduced from the partial observations of (i) riskless social choices and (ii) risky personal choices if and only if the DM finds every social state indifferent to some personal state. When this condition fails, there exist social lotteries whose ranking cannot be recovered from the partial data.

For two empirically relevant preference types, this condition generates directly testable predictions: for selfish subjects (who allocate nothing to others in deterministic social choices), risky personal preferences must coincide with risky social preferences; for impartial subjects (who treat self and other symmetrically in deterministic social choices), riskless social preferences must coincide with risky social preferences.

The experiment was conducted at the University of Bergen and NHH Norwegian School of Economics with 276 undergraduate subjects. Each subject faced 50 budget-line choice problems in each of three domains: Personal Risk (equiprobable binary lotteries over own payoffs only), Social Choice (deterministic splits between self and an anonymous other), and Social Risk (equiprobable binary lotteries over symmetric payout pairs for self and other). The graphical interface of Choi et al. (2007b) was used throughout. One randomly selected decision per domain was paid out; each token was worth 1.2 NOK (approximately 0.2 USD), with average earnings of approximately 270 NOK.

Within-domain consistency, measured by the Critical Cost Efficiency Index (CCEI), is high: mean CCEIs are 0.959, 0.952, and 0.902 in the Personal Risk, Social Choice, and Social Risk domains respectively. At the CCEI > 0.90 threshold, 89.9%, 85.9%, and 69.9% of subjects pass in the three domains. Using a 0.95 share-to-self threshold, 103 subjects (37.3%) are classified as selfish; using revealed-preference criteria at the 5% significance level, 33 subjects (12.0%) are classified as impartial.

Testing is done via an individual-level nonparametric permutation test that draws 10,000 random data sets per subject and compares simulated CCEI distributions to actual cross-domain CCEIs, with Bonferroni correction. At the 1% significance level, the null that Personal Risk and Social Risk preferences coincide is rejected for only 5.9%–9.3% of selfish subjects (varying by classification threshold), compared with 14.7%–16.3% rejection rates for non-selfish subjects. For impartial subjects at the 1% level, the null that Social Choice and Social Risk preferences coincide is rejected for 0.0%–11.1%, compared with 19.8%–26.8% for non-impartial subjects. The theory’s predictions are thus supported for a large majority of both selfish and impartial subjects.

A theoretical extension (Theorem 2) shows that if one additionally observes comparisons between social states and personal lotteries, unique deduction of the full preference relation requires that preferences in both personal and social domains satisfy Expected Utility (Independence Axiom) and that every social state is indifferent to some personal lottery — a strictly stronger set of conditions.

Q: What is the central theoretical question and why does it matter? A: The paper asks whether preferences over risky social choices (lotteries over outcomes for self and others) can be deduced from observing only riskless social choices and risky personal choices. This matters because people frequently observe or predict the risky social choices of leaders and representatives, but may have access only to those leaders’ personal risk-taking behavior and their expressed social preferences under certainty.

Q: What is the main theoretical result (Theorem 1)? A: Under Completeness, Transitivity, Continuity, and State Monotonicity, the unique extension of the partial preference relation (over social states and personal lotteries) to the full domain of social lotteries exists if and only if every social state is indifferent to some personal state. When this condition is not met, multiple distinct preference relations can extend the partial observations, making deduction impossible.

Q: What is State Monotonicity and how does it relate to standard axioms? A: State Monotonicity requires that if each social state in one lottery dominates the corresponding state in another lottery, then the first lottery is weakly preferred. The paper shows this is equivalent to respect for First-Order Stochastic Dominance (FOSD) given the other axioms, and is strictly weaker than the von Neumann–Morgenstern Independence Axiom. It is satisfied by Weighted Expected Utility, Rank-Dependent Utility, and Prospect Theory, making it a broadly applicable assumption.

Q: What are the testable predictions for selfish subjects? A: Proposition 2 establishes that if a subject’s Social Choice preferences are selfish — meaning any bundle (x, y) is indifferent to (0, y), so the subject is indifferent between keeping x for self and giving it to other — then preferences in the Personal Risk domain must coincide with preferences in the Social Risk domain. In the experiment, selfish subjects are those allocating more than 95% of tokens to themselves in the Social Choice domain (103 of 276 subjects, or 37.3%).

Q: What are the testable predictions for impartial subjects? A: Proposition 3 establishes that if a subject’s Social Choice preferences are symmetric — meaning (x, y) is indifferent to (y, x) for all pairs — then preferences in the Social Choice domain must coincide with preferences in the Social Risk domain, implying risk neutrality toward social lotteries. The intuition is that such a subject treats self and other identically, so risky splits are evaluated by expected value alone. In the experiment, 33 subjects (12.0%) are classified as impartial by the revealed-preference criterion at the 5% significance level.

Q: How does the experiment measure within-domain rationality? A: Choices within each domain are evaluated using the Critical Cost Efficiency Index (CCEI, following Afriat 1967), which measures how much a budget constraint must be relaxed to remove all GARP violations. Mean CCEIs are 0.959 (Personal Risk), 0.952 (Social Choice), and 0.902 (Social Risk). At the CCEI > 0.90 threshold, 248 subjects (89.9%), 237 (85.9%), and 193 (69.9%) pass in the three domains respectively, compared to a simulated mean CCEI of only 0.585 for subjects randomizing uniformly.

Q: How does the cross-domain test work and why is it nonparametric? A: The test uses individual-level permutation inference: under the null that preferences in domains I and J are identical, any 50-element subset drawn from the pooled 100 choices should satisfy GARP as well as the actual domain-specific choices. For each subject, 10,000 such random draws are generated, their CCEI scores are computed, and the distribution is compared to the actual cross-domain CCEI with Bonferroni correction. The test makes no functional form assumptions about utility and accommodates the observed within-domain errors without parametric error modeling.

Q: What are the rejection rates for the selfish-subject prediction? A: At the 1% significance level, the null that Personal Risk and Social Risk preferences coincide is rejected for only 5.9%–9.3% of selfish subjects (range across four classification thresholds from 0.99 to 0.90 share-to-self), compared to 14.7%–16.3% for non-selfish subjects. At the 5% level, rejection rates rise to 20.4%–25.6% for selfish and 22.4%–31.8% for non-selfish subjects.

Q: What are the rejection rates for the impartial-subject prediction? A: At the 1% significance level, the null that Social Choice and Social Risk preferences coincide is rejected for 0.0%–11.1% of impartial subjects (range depending on threshold and classification method), compared to 19.8%–26.8% for non-impartial subjects. At the 5% and 10% levels, rejection rates for impartial subjects range from 0.0% to 22.2%.

Q: Does the theory predict how risk aversion should map across domains for non-selfish, non-impartial subjects? A: The theory does not directly produce testable cross-domain predictions for subjects who are neither selfish nor impartial without additional parametric assumptions, because the specific personal-state equivalent of each social state depends on the form of preferences. The paper restricts its nonparametric tests to the two polar cases where the equivalence mapping is determinate from social choice behavior alone.

Q: What is the extended result (Theorem 2) and what stronger conditions does it require? A: When one additionally observes comparisons between social states and personal lotteries (not just within each domain separately), unique deduction of the full preference relation is possible if and only if preferences in both the personal and social domains are consistent with an Expected Utility representation and every social state is indifferent to some personal lottery. This requires the Independence Axiom — a strictly stronger condition than State Monotonicity — highlighting that the main Theorem 1 result exploits the weaker observational structure.

Q: What is the distribution of social preferences in the sample? A: Of 276 subjects, 103 (37.3%) are classified as selfish at the 0.95 share-to-self threshold. Only 6 subjects (2.2%) kept fewer than 0.45 of tokens on average, making purely altruistic subjects rare. In the Personal Risk domain, 41 subjects (14.9%) allocated more than 95% to the cheaper account (consistent with risk neutrality), while 9 (3.3%) allocated fewer than 55% (consistent with infinite risk aversion). In the Social Risk domain, 30 subjects (10.9%) are consistent with utilitarianism in money and 9 (3.3%) with Rawlsianism in money.

Q: How does the Social Risk domain compare to the Personal Risk and Social Choice domains in terms of rationality scores? A: The Social Risk domain shows lower consistency than the other two: mean CCEI is 0.902 versus 0.959 and 0.952, and only 69.9% of subjects exceed the 0.90 threshold versus 89.9% and 85.9%. The CCEI distribution is shifted left for Social Risk, suggesting the novel combined dimension of social and risky choice introduces more decision complexity or error.

Q: What is the relationship to the prior experimental literature on social and risk preferences? A: The Personal Risk domain replicates the symmetric risk experiment of Choi et al. (2007a), and the Social Choice domain replicates the linear two-person dictator experiment of Fisman et al. (2007). The Social Risk domain is new to this paper. The theoretical framework connects to Saito (2013) on social preferences under risk, and to the preference extension literature of Grant et al. (1992) and Nishimura et al. (2017).

State Monotonicity: The axiom requiring that if each social state in one lottery weakly dominates the corresponding social state in another lottery, the first lottery is weakly preferred. The paper proves this is equivalent to respect for First-Order Stochastic Dominance given Completeness, Transitivity, and Continuity, and distinguishes it from the stronger Independence Axiom by noting that Independence compares lotteries over lotteries while State Monotonicity only compares lotteries over states.

Selfish preferences (in the paper’s sense): Preferences in the Social Choice domain such that (x, y) is indifferent to (0, y) for all bundles — the subject is indifferent between receiving x themselves versus giving x to the other person. Operationally measured as allocating more than a threshold share (e.g., 95%) of tokens to self across Social Choice decisions.

Impartial preferences (in the paper’s sense): Preferences in the Social Choice domain such that (x, y) is indifferent to (y, x) for all bundles — the subject treats self and other symmetrically. Operationally identified by the revealed preference criterion that choices in the Social Choice domain satisfy GARP and are consistent with symmetric treatment.

Unique extension (deducibility): The property that there exists exactly one complete preference relation over all social lotteries that is consistent with the axioms and agrees with the observed partial relation over social states and personal lotteries. Theorem 1 identifies the necessary and sufficient condition for unique extension under State Monotonicity.

Personal state indifference condition: The condition that for every social state omega in Omega minus P, there exists some personal state in P to which the DM is indifferent. This is the necessary and sufficient condition in Theorem 1 for deducibility of the full preference relation. Interpreted as: for every proposed social allocation, there exists a “bribe” — a personal allocation with nothing for others — that the DM finds equally desirable.

Critical Cost Efficiency Index (CCEI): A measure of how much budget constraints must be scaled down to eliminate all GARP violations in a dataset of choices from budget lines (following Afriat 1967). A CCEI of 1 indicates perfect rationality; the paper uses 0.90 as a practical threshold. Mean values are 0.959, 0.952, and 0.902 in the Personal Risk, Social Choice, and Social Risk domains respectively.

Nonparametric permutation test: The individual-level test used to assess consistency across choice domains. Under the null that preferences are identical in domains I and J, any random 50-element draw from the pooled 100 choices should achieve CCEI scores no worse than the actual domain scores. The test draws 10,000 permuted datasets per subject and uses the Bonferroni correction for multiple comparisons, making no assumptions about the functional form of utility.

Mis(sed) Diagnosis: Physician Decision Making and ADHD

Mon, 01 Jan 0001 00:00:00 +0000

This paper develops and estimates a structural model of ADHD diagnosis to decompose the mechanisms driving the observed 2.3:1 male-to-female diagnostic difference in the United States. The research question is: to what extent does the large gender gap in ADHD diagnosis reflect true differences in symptom prevalence, versus patient-side utilization costs, versus physician decision-making under uncertainty? The setting is particularly well-suited to this question because DSM-V diagnostic guidelines for ADHD are explicitly gender-neutral, making any gender difference in physician thresholds a detectable deviation from uniform clinical rules.

The data come from de-identified electronic health records from a large Arizona healthcare system covering January 2014 through September 2017. The sample encompasses 36,193 unique encounters for approximately 11,070 pediatric patients. The raw male-to-female diagnostic ratio in the data is 2.32:1 (7.2% of males vs. 3.1% of females receive a clinical ADHD diagnosis). This gap persists after controlling for demographics, general healthcare utilization, and mental health utilization in reduced-form regressions, motivating the structural approach.

Because two key variables — whether a patient received a behavioral assessment (Qi) and the ADHD match signal observed by the physician (xi) — are not directly recorded in the EHR, the author constructs them from clinical doctor note text. A random forest machine learning classifier trained on labeled appointments predicts behavioral assessment take-up for unlabeled encounters; approximately 20.8% of children are predicted to have received a behavioral assessment (23.2% of males vs. 18.3% of females). The ADHD match signal is constructed via an adjusted Bag-of-Words cosine similarity measure comparing each patient’s aggregated note text to the DSM-V symptom list, rescaled to [0,1]. The average signal is 0.319 overall, with males averaging 0.326 and females 0.311.

The structural model has three stages. First, patients/caregivers decide whether to schedule a behavioral assessment, a function of underlying latent ADHD risk (vi) and mental healthcare utilization costs (ci). Second, conditional on assessment, the physician receives a noisy signal of vi and updates beliefs via Bayesian learning; signal quality ρ governs diagnostic uncertainty. Third, the physician diagnoses ADHD if posterior risk exceeds a gender-specific diagnostic threshold τ. Population mean ADHD risk (μ) is identified using regression-adjusted initial primary care provider referral rates as a quasi-exogenous cost-shifter — patients of high-referral-rate providers select into assessment less selectively, so their observed signals approach population mean risk. This extrapolation approach follows Arnold et al. (2022).

The structural parameter estimates reveal that male and female children have similar but slightly different mean ADHD risk (μm = 0.290 vs. μf = 0.262) and similar mean utilization costs (cm = 0.116 vs. cf = 0.109). The most striking differences are in physician parameters: signal quality is lower for male patients (ρm = 0.479 vs. ρf = 0.552), indicating higher diagnostic uncertainty for boys; and diagnostic thresholds are substantially lower for male patients (τm = 0.257 vs. τf = 0.312), meaning physicians are willing to diagnose ADHD in boys with lower posterior risk.

Counterfactual decomposition simulations attribute approximately 20–25% of the 2.32:1 diagnostic gap to underlying differences in ADHD risk, approximately 20% to differences in selection into behavioral assessments, and the remaining majority — approximately 55–60% — to physician decision-making. Within physician decision-making, differences in diagnostic thresholds alone account for roughly two-thirds of the overall diagnostic gap.

The paper offers economic rationales for why gender-specific thresholds may be consistent with physician rationality despite uniform guidelines: higher diagnostic uncertainty for boys justifies lower thresholds under Bayesian updating; hyperactive/impulsive symptoms predominant in boys impose larger classroom externalities (Aizer, 2008); and female patients show higher rates of internalizing co-morbidities (anxiety, depression) that may reduce the marginal benefit of an additional ADHD diagnosis. A type-specific threshold extension finds that for male patients the threshold for hyperactive/impulsive symptoms is significantly lower than for inattentive symptoms, consistent with salience of externally disruptive behaviors. These rationalizations do not vindicate the gap as fully guideline-consistent, but suggest physicians may be responding to real heterogeneity in external costs and co-morbidity patterns.

Q: What is the main research question and why is ADHD a useful setting? A: The paper asks what mechanisms produce the 2.3:1 male-to-female ADHD diagnostic difference: true symptom prevalence, patient utilization costs, or physician decision-making. ADHD is well-suited because (1) clinical guidelines (DSM-V) are explicitly gender-neutral and require the same symptom count threshold regardless of sex; (2) diagnosis is based on subjective behavioral assessment rather than objective testing, creating substantial physician discretion; and (3) both missed and excess diagnosis carry meaningful costs — missed diagnosis limits educational accommodations; excess diagnosis exposes children to Schedule II controlled substances.

Q: What data does the paper use and what are the key descriptive facts? A: The data are de-identified electronic health records from a large Arizona healthcare system, 2014–2017, covering 36,193 encounters for 11,070 pediatric patients aged 5 and above. Overall ADHD diagnosis rate is 5.2%, with males at 7.2% and females at 3.1%, a 2.32:1 ratio that matches national levels. Approximately 49.5% of the sample is Hispanic, which the author notes contributes to a below-national-average overall diagnosis rate. The gender diagnostic gap persists even after controlling for demographics, general healthcare utilization, and mental health utilization in reduced-form regressions.

Q: How does the paper construct the behavioral assessment indicator (Qi) and the ADHD match signal (xi)? A: Qi is constructed using a random forest classifier trained on doctor notes from appointments where assessment status is known with near-certainty (ADHD diagnosis or DSM-V comorbid diagnosis = positive; non-mental-health diagnosis code for patients with no mental health history = negative). The classifier uses 41 features including note length and top-20 word frequencies for each label class. xi is constructed via an adjusted Bag-of-Words cosine similarity between each patient’s combined behavioral assessment notes and the DSM-V symptom list, separately for inattentive and hyperactive/impulsive sub-types, taking xi = max{xi1, xi2}. The average xi is 0.319 (males 0.326, females 0.311) in the behavioral assessment subsample.

Q: What is the identification strategy for recovering population mean ADHD risk (μ)? A: Because xi is observed only for endogenously selected patients, the observed sample mean overestimates population mean risk. The author uses regression-adjusted referral rates of each patient’s initial primary care provider (IPCP) as a quasi-exogenous cost-shifter satisfying (a) relevance — IPCP referral intensity lowers patient scheduling costs — and (b) independence from patient ADHD risk vi, since IPCPs are typically chosen before behavioral symptoms develop and only 28% of IPCPs in the sample ever diagnose ADHD themselves. Population mean risk is then recovered by extrapolating the relationship between IPCP referral propensity and average observed xi to propensity = 1, following Arnold et al. (2022). The maximum observed IPCP referral propensity is only about 0.75, so the estimate requires extrapolation beyond the observed support.

Q: What are the estimated structural parameters and what do they imply? A: Mean ADHD risk is μm = 0.290 vs. μf = 0.262 — males have modestly higher underlying risk. Mean utilization costs are cm = 0.116 vs. cf = 0.109 — nearly identical across genders. Signal quality (diagnostic certainty) is lower for males: ρm = 0.479 vs. ρf = 0.552, indicating physicians face more diagnostic uncertainty when assessing boys. Most importantly, diagnostic thresholds are lower for males: τm = 0.257 vs. τf = 0.312, meaning physicians diagnose ADHD in boys at a lower required posterior risk level, consistent with viewing missed diagnosis as relatively more costly for male patients.

Q: How much of the 2.32:1 diagnostic gap can be attributed to each mechanism? A: Counterfactual simulations decompose the gap as follows: differences in underlying ADHD risk distribution account for approximately 20–25% of the diagnostic difference; differences in selection into behavioral assessments (utilization costs operating through assessment rates) account for approximately 20%; and physician decision-making differences account for the remaining majority, approximately 55–60%. Within physician factors, differences in diagnostic thresholds (τm < τf) are the single largest contributor, explaining roughly two-thirds of the overall male/female diagnostic gap.

Q: What do the type-specific threshold estimates reveal? A: When the baseline model is extended to allow separate diagnostic thresholds for inattentive vs. hyperactive/impulsive symptom sub-types, male patients show significantly lower thresholds for hyperactive/impulsive symptoms relative to inattentive symptoms (τ^HI_m < τ^Inatt_m). This is consistent with the hypothesis that more externally salient and disruptive symptoms carry larger classroom externalities, which physicians may implicitly factor into diagnosis decisions (following Aizer, 2008). For female patients, the threshold differences across symptom types are smaller and less statistically significant.

Q: What economic rationales does the paper offer for gender-specific diagnostic thresholds despite uniform guidelines? A: Three mechanisms are identified. First, higher diagnostic uncertainty for males (lower ρm) implies that under symmetric costs, Bayesian-rational physicians should set lower thresholds when the signal is noisier — this alone partially rationalizes the threshold gap. Second, hyperactive/impulsive symptoms predominant in boys impose greater externalities on classroom peers (Aizer, 2008), increasing the social benefit of diagnosis for boys on the margin. Third, females show substantially higher rates of co-morbid internalizing conditions (anxiety, depression) whose treatment may mitigate ADHD-related behaviors or whose interaction with stimulant medication makes the marginal ADHD diagnosis less beneficial for girls (Currie et al., 2014). These factors together suggest physicians may be responding to genuine heterogeneity in net diagnosis benefits, even if their behavior deviates from gender-neutral clinical guidelines.

Q: What share of the 2.3:1 national diagnostic gap is consistent with genuine symptom prevalence differences? A: Simulations indicate that only about 20–25% of the 2.32:1 male/female diagnostic difference can be explained by the underlying difference in ADHD risk distributions. The majority — roughly 75–80% — reflects factors beyond true prevalence: selection into care and, most substantially, physician decision-making differences including both signal quality and diagnostic thresholds.

Q: What are the policy implications? A: The findings suggest that targeted interventions in physician awareness and clinical training are likely more effective than generic awareness campaigns, since the dominant driver of the diagnostic gap is physician threshold-setting rather than symptom prevalence. Structured decision support tools or updated training that make physicians aware of gender-specific diagnostic patterns could reduce medically unwarranted diagnostic differences. Policies targeting patient-side access barriers (the ~20% explained by selection) remain relevant but secondary. The roughly 20–25% of the gap attributable to genuine symptom prevalence differences is, by construction, guideline-consistent and should not be targeted for elimination.

Q: What are the methodological contributions? A: The paper makes three methodological contributions. First, it develops a structural model of mental health diagnosis that explicitly incorporates endogenous patient selection — a feature absent from standard physician decision-making models — which is shown empirically important. Second, it applies machine learning and NLP to clinical doctor note text to construct key unobserved clinical variables (behavioral assessment indicator and ADHD match signal) that are unavailable as structured data in EHRs. Third, the identification of population mean health risk uses a quasi-exogenous variation approach (IPCP referral rates) analogous to Arnold et al. (2022)’s method for measuring racial discrimination in bail decisions, adapted here to a continuous health risk setting with endogenous selection.

Diagnostic threshold (τ_θ): The gender-specific posterior ADHD risk level above which a physician chooses to diagnose ADHD. Set ex-ante, it reflects the physician’s perceived tradeoff between the costs of over-diagnosis (misdiagnosis) and under-diagnosis (missed diagnosis). A lower threshold implies the physician views missed diagnosis as relatively more costly for that patient group. By construction, uniform clinical guidelines imply a single threshold independent of patient gender.

ADHD match signal (x_i): A physician-observed, noisy signal of a patient’s true latent ADHD risk (v_i), observed only conditional on the patient receiving a behavioral assessment. In estimation, it is proxied via a cosine similarity measure between the patient’s aggregated clinical doctor note text and the DSM-V symptom list, constructed separately for inattentive and hyperactive/impulsive sub-types.

Signal quality / diagnostic uncertainty (ρ_θ): The correlation between the physician’s observed ADHD match signal and the patient’s true ADHD risk. Higher ρ means the physician’s signal is more informative and diagnostic uncertainty is lower. In the Bayesian updating framework, higher ρ implies the physician places more weight on the observed signal relative to the prior.

Mental healthcare utilization cost (c_i): The composite of all patient/caregiver factors that affect the decision to schedule a behavioral assessment net of child symptom level. Includes non-monetary barriers such as time constraints, distance, stigma, and information from primary care providers during wellness visits; does not include monetary out-of-pocket costs since insurance typically covers behavioral assessments.

Initial Primary Care Provider (IPCP) referral rate: The regression-adjusted share of a given PCP’s patients who ultimately receive a behavioral assessment at some point in the sample. Used as a quasi-exogenous cost-shifter that influences patient scheduling costs without being correlated with patient ADHD risk, enabling identification of population mean ADHD risk via extrapolation.

Latent ADHD risk (v_i): An unobserved continuous measure of a child’s underlying ADHD-related behavioral symptoms, drawn from a gender-specific normal distribution N(μ_θ, σ²_θ). A child’s true ADHD status is Si = 1(v_i > v̄), where v̄ is the DSM-V minimum symptom threshold, defined identically for boys and girls.

Adjusted Bag-of-Words (BOW) cosine similarity: The NLP method used to construct the ADHD match signal proxy. Patient notes are tokenized into uni-grams and bi-grams after preprocessing (spell check, abbreviation replacement, part-of-speech tagging, synonym replacement), and tf-idf weighted. The cosine similarity between the resulting document vector and the DSM-V symptom text vector is computed separately for each ADHD sub-type and rescaled to [0,1].

Optimal Decision Rules When Payoffs are Partially Identified

Mon, 01 Jan 0001 00:00:00 +0000

This paper derives asymptotically optimal statistical decision rules for discrete choice problems when the payoffs associated with some choices are only partially identified. The research question is: how should a decision maker who can bound but not point-identify a payoff-relevant parameter θ use data to make optimal policy choices?

The framework separates two parameter types. The reduced-form parameter µ is point-identified and can be estimated from data. The structural parameter θ — such as the average treatment effect (ATE) in a target population — is set-identified, meaning only that θ ∈ Θ0(µ) can be established, where the identified set is indexed by µ. The decision maker confronts both ambiguity (arising from partial identification of θ given µ) and statistical uncertainty (µ must be estimated).

The authors propose a hybrid optimality criterion that applies minimax reasoning to the partially-identified parameter θ — choosing actions that minimize maximum risk over Θ0(µ) — while applying average (integrated) risk minimization over µ, reflecting the asymmetric nature of the two identification problems. This asymmetric treatment follows the generalized Bayes-minimax principle of Hurwicz (1951).

The optimal decision rule is implemented by computing, for each action, the maximum risk (or regret) over θ ∈ Θ0(µ) conditional on µ, then averaging this maximum risk across either (i) a bootstrap distribution for an efficient estimator µ̂, (ii) a posterior distribution for µ in parametric models, or (iii) a quasi-posterior based on a limited-information criterion in semiparametric models. The optimal action is whichever choice has the smallest average maximum risk.

A central theoretical result (Theorems 1 and 4) establishes formal asymptotic optimality for both parametric and semiparametric settings: Bayes and quasi-Bayes decisions with any prior whose density is positive, bounded, and continuous are asymptotically equivalent and optimal. Critically, the optimality of these rules is asymptotically independent of the choice of prior for µ. The authors also establish a necessity result (Theorems 2 and 5): any decision rule not asymptotically equivalent to the Bayes or bootstrap rule is strictly sub-optimal.

A key finding is that “plug-in” rules — which substitute an efficient point estimate µ̂ directly into the oracle decision rule — can be sub-optimal. This failure occurs generically under partial identification because the maximum risk function R(d,µ) is typically only directionally differentiable (not fully differentiable) in µ, owing to max and min operators in intersection bounds, linear program value functions, or other bound constructions. When full differentiability holds, Corollary 1 confirms plug-in rules are optimal; otherwise they are not. The empirical illustration demonstrates the practical consequence: for German male youths deciding whether to adopt a job-training program based on 14 RCT studies from Card, Kluve, and Weber (2017), the optimal rule recommends treatment (average quasi-posterior robust welfare contrast b̄n > 0) while the plug-in rule recommends against treatment (plug-in value b(µ̂) < 0). The lower bound maximum of µ̂k − C‖x0 − xk‖ is −0.3190 for the leading US study and −0.3298 for the second-best Brazilian study; because these two values are close relative to the average standard error of 0.034 across studies, the lower bound distribution is right-skewed (behaving like the maximum of two Gaussians), pushing b̄n positive even though b(µ̂) is negative.

The paper extends optimality theory to semiparametric models via a least favorable parametric submodel, introduces the concept of σ-optimality for cases where the average maximum risk criterion is infinite (relevant when the dimension K of µ exceeds 1), and provides detailed implementation guides for treatment assignment under intersection bounds, IV-like estimands, and non-separable panel data, as well as for optimal pricing decisions where revealed-preference demand theory bounds counterfactual demand responses via linear programming.

Scope conditions: optimality results apply to discrete action spaces, require efficient estimation of µ, require the identified set Θ0(µ) to be known as a set-valued mapping, and assume no “first-order ties” (the oracle decision is unique at µ0). The asymptotic framework is local, mimicking the finite-sample problem where µ is not known with certainty.

Q: What is the core decision problem this paper addresses?

A: A decision maker must choose from a finite set of actions D = {0, 1, …, D}. Payoffs depend on a structural parameter θ that is only set-identified — the data can establish θ ∈ Θ0(µ) but not pin down θ exactly. The reduced-form parameter µ is point-identified and estimated from data. The decision maker faces both ambiguity (which θ in Θ0(µ) is true?) and sampling uncertainty (what is µ?). The paper asks how to construct decision rules that are optimal in large samples under this dual uncertainty.

Q: What is the proposed optimality criterion, and why is it asymmetric across parameters?

A: The criterion applies minimax reasoning to the partially-identified θ — the maximum risk over Θ0(µ) given µ is the relevant loss — and integrates this maximum risk over µ using Lebesgue measure on local perturbations h = √n(µ − µ0) of a fixed µ0. The asymmetry reflects the fact that θ is not updated by the data (the prior for θ is not identified), while µ can be learned efficiently from the data. Full minimax over both (θ, µ) is rarely tractable even for simple binary treatment problems; the asymmetric approach yields tractable optimal rules for a broad empirically relevant class of settings.

Q: What are the Bayes, bootstrap, and quasi-Bayes implementations of the optimal rule?

A: In all three cases, the decision maker computes R̄n(d) — the average maximum risk for action d — and chooses the action that minimizes it. The Bayes rule averages R(d, µ) over the posterior πn(µ|Xn) for µ using Bayes’ theorem with a prior π on M. The bootstrap rule averages R(d, µ̂*) over bootstrap redraws µ̂* of the efficient estimator µ̂. The quasi-Bayes rule (for semiparametric models) uses a limited-information quasi-posterior N(µ̂, (nÎ)−1) combining a Gaussian quasi-likelihood with a prior for µ. All three implementations are asymptotically equivalent and optimal under the regularity conditions of Theorems 1 and 4.

Q: What do Theorems 1 and 2 (and their semiparametric analogues Theorems 4 and 5) establish?

A: Theorem 1 establishes sufficiency: Bayes decisions with any prior in the class Π are asymptotically equivalent to each other and are optimal; any rule asymptotically equivalent to such a Bayes decision is also optimal. Theorem 2 establishes necessity: any rule in the admissible class D that is not asymptotically equivalent to the Bayes rule has strictly higher average excess risk at any µ0 where asymptotic equivalence fails. Together, these theorems fully characterize the class of asymptotically optimal rules and show that the Bayes/bootstrap class is not merely sufficient but also necessary for optimality.

Q: When are plug-in rules sub-optimal, and when are they optimal?

A: Plug-in rules substitute an efficient point estimate µ̂ directly into the oracle decision δo(µ̂). If R(d, µ) is fully differentiable at µ0 for all oracle-optimal actions d, then the directional derivative is linear and plug-in and Bayes rules are asymptotically equivalent; Corollary 1 confirms plug-in rules are then optimal. However, under partial identification, max and min operators in bound constructions — intersection bounds, linear program value functions, revealed-preference bounds — generically induce only directional (non-linear) differentiability of R(d, µ). In these cases asymptotic equivalence can fail, and Theorem 2 implies plug-in rules are sub-optimal. Manski (2021, 2023) documents poor finite-sample performance of plug-in rules numerically; the authors’ necessity result provides a general theoretical explanation under the asymptotic average risk criterion.

Q: How does the treatment assignment empirical illustration demonstrate the difference between optimal and plug-in rules?

A: Using data from Ishihara and Kitagawa (2021) with K = 14 RCT studies from Card, Kluve, and Weber (2017) and Lipschitz constant C = 0.25, the decision is whether to adopt a job-training program for German male youths or female youths in 2010 (GDP growth 3.48%, unemployment 9.45%). For male youths, the largest lower bound value µ̂k − C‖x0 − xk‖ is −0.3190 (US study) and the second-largest is −0.3298 (Brazilian study), separated by only 0.0108 against an average standard error of 0.034 across studies, so the lower bound distribution is right-skewed (maximum of two near-tied Gaussians). This right-skew pushes the quasi-posterior mean b̄n positive, yielding a treatment recommendation, while the plug-in value b(µ̂) is negative, yielding a non-treatment recommendation — a concrete reversal of the policy decision. For female youths, the minima and maxima are better separated, the distribution is near-Gaussian, and b̄n ≈ b(µ̂), so both rules agree on treatment.

Q: What are intersection bounds and why do they generate directional differentiability?

A: Intersection bounds arise when the ATE is bounded in K separate observational studies by lower bounds bL,k(µk) and upper bounds bU,k(µk). The combined identified set uses bL(µ) = max_{1≤k≤K} bL,k(µk) and bU(µ) = min_{1≤k≤K} bU,k(µk). Even if each component bound is smooth in µk, the max and min operators make bL and bU only directionally differentiable (not fully differentiable) in µ. The directional derivative is positively homogeneous of degree one but non-linear, which is the property that drives the wedge between Bayes and plug-in rules.

Q: How does the paper extend to semiparametric models, and what technical tool does it use?

A: In semiparametric models, the data distribution depends on both µ ∈ R^K and an infinite-dimensional nuisance parameter η. Integrating over local perturbations of η as well as µ raises measure-theoretic problems in infinite-dimensional spaces. The authors instead restrict attention to local perturbations of µ0 within a least favorable parametric submodel, which is the direction that makes the problem hardest. The quasi-posterior N(µ̂, (nÎ)−1) is then used as the averaging distribution, combining a Gaussian quasi-likelihood with a prior for µ. Theorem 4 establishes optimality and Theorem 5 establishes necessity under these semiparametric conditions, mirroring the parametric Theorems 1 and 2.

Q: What is σ-optimality and why is it needed?

A: When the dimension K of µ exceeds 1, the integrated average excess risk criterion R({δn}; µ0) — which integrates over Lebesgue measure on R^K — may be infinite for all decision sequences in D, making the criterion uninformative. σ-optimality approximates the improper Lebesgue prior on h by a sequence of proper priors indexed by σ, and requires that the decision rule minimize the resulting criterion for all σ. Theorem 3 shows that the limiting behavior of σ-optimal rules coincides with that of the Bayes rule δ*n(·; π), preserving the practical implementation.

Q: How is the optimal pricing application structured and what role do revealed-preference bounds play?

A: A monopolist observes repeated cross-sections of individual demands across B budget sets and must choose a price vector from D = O ∪ C, where O contains observed prices and C contains counterfactual prices. For observed prices, average demand is identified; for counterfactual prices, only bounds are available. Following Kitamura and Stoye (2019), the space of goods is partitioned into GARP-compatible regions, and sharp bounds on counterfactual demand are computed by solving linear programs over the mass allocated to each region subject to GARP consistency constraints. The reduced-form parameter µ collects empirical choice probabilities across observed budget-region cells, estimated consistently by sample frequencies. The optimal pricing decision averages the linear-program bound solutions across quasi-posterior draws of µ.

Q: How does this approach relate to minimax and conditional Γ-minimax approaches?

A: Full minimax over (θ, µ) requires strong distributional assumptions and tractable finite-sample distributions; the authors note that no minimax treatment rule exists even for binary treatment with binary outcomes and estimated bounds. Conditional Γ-minimax (DasGupta and Studden, 1989; Giacomini, Kitagawa, and Read, 2021) fixes a prior for µ and takes minimax over the set of priors for θ conditional on µ; this is closely related to the authors’ approach but can be conservative when the marginal prior for µ varies. The authors’ framework fixes the marginal prior for µ and takes minimax over θ ∈ Θ0(µ) conditional on µ, which is shown to arise as the equilibrium of a two-player zero-sum game where adversarial nature chooses a prior for θ ∈ Θ0(µ) conditional on µ and the available data for µ.

Q: What is the technical contribution regarding directionally differentiable functions?

A: Hirano and Porter (2009) derived asymptotic optimality for treatment rules under fully differentiable welfare contrasts. This paper extends that theory to settings with directional (but not full) differentiability — a generic feature whenever bounds involve max/min operators or linear program values. The key technical building block is the asymptotic distribution of the quasi-posterior mean of directionally differentiable functions (Propositions 2 and 3 in Appendix C). While Kitagawa, Montiel Olea, Payne, and Velez (2020) characterized the asymptotic behavior of the posterior distribution of such functions, this paper instead characterizes the frequentist distribution of the posterior mean — a distinct and novel contribution to the literature on asymptotics for non-smooth functions (Dümbgen, 1993; Fang and Santos, 2019).

Q: What are the key scope conditions and limitations of the optimality results?

A: The action space D must be finite and discrete (continuous pricing must be approximated by a grid of whole-currency units, as noted in the introduction). The identified set mapping Θ0(·) must be known. Efficient estimation of µ is required, along with a consistent estimator of its asymptotic variance for quasi-Bayes implementation. The optimality criterion assumes “no first-order ties” — the oracle decision must be unique at µ0. The framework is asymptotic (local perturbations around a fixed µ0), and the theory is designed for settings where deriving exact finite-sample optimal rules is intractable. The results do not cover the case where θ affects the data distribution (only payoffs are partially identified, not identification of µ itself).

Partially-identified parameter (θ): A structural parameter — such as the ATE in a target population — about which the data can establish only set membership θ ∈ Θ0(µ), not a point value. The identified set Θ0(µ) is indexed by the point-identified reduced-form parameter µ.

Oracle decision (δo(µ)): The infeasible first-best decision that minimizes maximum risk over the identified set Θ0(µ) for a known value of µ. It serves as the benchmark against which practical rules are evaluated; any data-dependent rule can only do weakly worse.

Maximum risk (R(d, µ)): The supremum of risk r(d, θ, µ) = Eθ[l(d, Y, θ, µ)] over all θ ∈ Θ0(µ) conditional on µ. Under the regret criterion for binary treatment, R(0, µ) = (bU(µ))+ and R(1, µ) = −(bL(µ))−.

Robust welfare contrast (b(µ)): In the treatment assignment application, b(µ) = (bU(µ))+ + (bL(µ))−, whose sign determines the oracle decision: treat if b(µ) ≥ 0. The optimal rule replaces b(µ) with its quasi-posterior mean b̄n.

Directional differentiability: A function f : M → R^k is directionally differentiable at µ0 if limits of (f(µ0 + tn hn) − f(µ0))/tn exist for all sequences tn ↓ 0 and hn → h, yielding a directional derivative ḟµ0[·] that is positively homogeneous but not necessarily linear. Max/min operators and linear program value functions are generically only directionally differentiable, not fully differentiable. This property is what causes plug-in rules to fail.

Quasi-posterior: In semiparametric models, a posterior-like distribution for µ formed by combining a limited-information Gaussian quasi-likelihood N(µ̂, (nÎ)−1) with a prior π, yielding πn(µ|Xn) ∝ exp(−½(µ − µ̂)T(nÎ)(µ − µ̂))π(µ). Used in place of a full Bayesian posterior when the exact likelihood of the data-generating process is unavailable.

σ-optimality: An optimality concept that replaces the improper Lebesgue prior on local perturbations h ∈ R^K with a sequence of proper priors indexed by σ, used when the average excess risk criterion is infinite for K > 1. Theorem 3 establishes that the σ-optimal decision rule converges to the Bayes rule as σ → ∞.

Plug-in rule (δplug_n): A decision rule formed by substituting an efficient point estimate µ̂ directly into the oracle decision: δplug_n = δo(µ̂). Optimal when R(d, µ) is fully differentiable (Corollary 1), but generically sub-optimal under partial identification because directional differentiability of R(d, µ) breaks the asymptotic equivalence between the plug-in and Bayes rules.