D84 | Macro Paper Warehouse

Biased expectations and labor market outcomes: Evidence from German survey data and implications for the East–West wage gap

Mon, 01 Jan 0001 00:00:00 +0000

Layer 1 — Overview

Research question. The paper asks two questions: (1) How do workers’ biased expectations about job finding and job separation shape the labor market equilibrium and wages? (2) Are differences in expectation biases across workers a quantitatively important driver of wage differentials, specifically the East–West German wage gap?

Data. The empirical analysis uses the German Socio-Economic Panel (SOEP), a nationally representative longitudinal survey of approximately 30,000 participants per wave. The working-age sample (ages 25–65) covers nine biennial survey waves from 1999 to 2015, yielding 67,772 observations for job separation expectations and 6,423 for job finding expectations. Perceived transition probabilities are reported on a 0–100 scale in steps of 10 percentage points. Actual (statistical) transition probabilities are constructed by estimating probit models that predict realized transitions within 24 months using a rich set of individual, job, and employer characteristics, and are rounded to the nearest decile for consistency with the survey scale.

Main empirical findings. Employed workers in Germany overestimate their job separation probability by 6.4 percentage points on average (perceived: 19.8%; actual: 13.3%), a pessimistic bias significant at the 1% level. Unemployed workers overestimate their job finding probability by 8.2 percentage points on average (perceived: 57.0%; actual: 48.8%), an optimistic bias also significant at the 1% level. The East–West divergence is striking. East German workers exhibit a pessimistic job separation bias of 12.1 percentage points, compared to only 4.7 percentage points in the West, despite broadly similar actual separation rates (15.1% vs. 12.8%). For job finding, West Germans overestimate their probability by 12.9 percentage points, while East Germans overestimate by only 2.0 percentage points — meaning East Germans are also substantially less optimistic about re-employment. These East–West differences survive controls for compositional differences and alternative definitions of job separation (dismissals only; selected reasons; spell-based) and job finding (including those out of the labor force). The biases are stable over the 1999–2015 sample period with no discernible trend. A cohort analysis shows that the excess pessimism in East Germany is concentrated among cohorts who were already in the labor market at the time of German reunification (born in the 1950s and 1960s), consistent with persistent effects of the communist GDR experience. Individuals do not systematically learn over time: mean changes in individual-level absolute deviations between consecutive waves are close to zero. Individual deviations between perceived and actual rates have statistically significant but quantitatively negligible predictive power for subsequent transitions (a 1 pp higher perceived job separation is associated with only a 0.001 pp higher realized separation rate), ruling out private information as a first-order explanation for the biases.

Model. The authors extend the Diamond–Mortensen–Pissarides (DMP) frictional labor market framework by (i) allowing workers to hold biased perceived transition rates (λw for job finding, σw for job separation) while firms have rational expectations, and (ii) introducing wage contracts of explicit length T periods after which parties re-bargain. Common knowledge of each party’s perceived values is assumed, and generalized Nash bargaining is applied. The contract length T is a key parameter: there exists a critical threshold T* such that a pessimistic job separation bias raises the equilibrium wage for T < T* (the continuation-value effect dominates) and lowers it for T ≥ T* (the within-contract discounting effect dominates). An optimistic job finding bias unambiguously raises the equilibrium wage by inflating the perceived value of unemployment and hence the reservation wage.

Quantitative results. The model is calibrated to East Germany. The job separation bias (∆σ = 0.0194) and job finding bias (∆λ = 0.0044) are set to SOEP-based estimates. The critical threshold implied by calibrated parameter values is T* = 10 quarters. The baseline contract length, constructed from the share of permanent (88%) and temporary (12%) contracts in SOEP and average remaining tenure until retirement, is T = 67 quarters (a lower bound). This exceeds T*, so the pessimistic separation bias depresses wages in the baseline. A counterfactual experiment assigns West German bias levels to East German workers, while holding all other parameters fixed. For the preferred calibration range (γ ∈ {0.35, 0.50}, T ∈ {67, 106, 159}), East German wages rise by 1.07 to 2.36 percent. This corresponds to a reduction in the conditional East–West German wage gap (23 percent) of 4.6 to 10.6 percent, and a reduction in the unconditional gap (30 percent) of 3.6 to 7.9 percent. Although wages rise, equilibrium unemployment increases by 0.70 to 1.01 percentage points, widening the already large East–West unemployment gap (approximately 7 percentage points). Net of the unemployment effect, expected lifetime income (computed at actual, unbiased transition rates) rises by 0.7 to 1.88 percent for East German workers under West German biases, implying an unambiguous welfare gain. Under a biennial calibration (robustness), wages increase by up to 3.3 percent and expected lifetime income rises by up to 2.23 percent.

Scope conditions. Results apply to a stationary environment (no aggregate fluctuations). Firms are assumed to have rational expectations; an extension shows results hold provided firm bias is smaller than worker bias. Workers are assumed homogeneous in their bias levels; learning is abstracted from. The quantitative magnitudes are sensitive to the workers’ bargaining power γ and the contract length T, both of which are subject to uncertainty in calibration.

Layer 2 — Q&A

Q1: How are actual (statistical) transition probabilities constructed, and why are probit-predicted probabilities preferred over realized sample means? A: Realized transition rates in the sample mix transitions for various idiosyncratic reasons that vary substantially across population groups, so raw sample means do not reflect the probability a given individual faces at interview time. The authors estimate probit models separately for job separation (employed sample) and job finding (unemployed sample), including a rich set of covariates — age, gender, education, tenure, firm size, unemployment experience, industry, survey year, and East Germany indicator, among others — and predict individual-level probabilities at the time of the interview. For consistency with the survey’s discrete response format, probit-predicted probabilities are rounded to the nearest decile (0%, 10%, …, 100%). The bias is computed as the individual-level difference between perceived and probit-predicted actual probabilities, averaged over the sample.

Q2: What is the magnitude and direction of the aggregate expectation biases in Germany? A: Employed workers overestimate job separation by 6.4 percentage points on average (perceived 19.8% vs. actual 13.3%), a pessimistic bias significant at the 1% level. Unemployed workers overestimate job finding by 8.2 percentage points (perceived 57.0% vs. actual 48.8%), an optimistic bias also significant at the 1% level. Both directions are statistically robust across alternative definitions of separation and finding, as well as to trimming extreme responses (0% and 100% answers) and adjusting for directional rounding.

Q3: How large are the East–West differences in expectation biases, and do they survive controls for compositional differences? A: East German workers exhibit a pessimistic job separation bias of 12.1 percentage points, more than 2.5 times the West German level of 4.7 percentage points, despite actual separation rates being broadly comparable (15.1% vs. 12.8%). For job finding, West Germans are optimistic by 12.9 percentage points while East Germans are optimistic by only 2.0 percentage points, a difference of 10.9 percentage points. The paper states these differences persist after accounting for compositional differences between regions, and are robust across all alternative definitions of job separation (Dismissals, Selected, Spell) and job finding (out of U or O). The table of robustness results (Table 2) confirms that in all specifications, the pessimistic separation bias is substantially larger in the East and the optimistic finding bias is substantially smaller.

Q4: What cohort analysis is conducted to explore the origins of greater East German pessimism? A: The authors conduct a regression of the individual-level bias on birth-cohort indicators, controlling for age, demographic, and economic characteristics. They find that the pessimistic job separation bias is most pronounced among cohorts born in the 1950s and 1960s — those who experienced adult working life in the communist GDR and lived through reunification — and is smaller for cohorts born before 1950 and substantially smaller for cohorts born after 1970. For job finding, the optimistic bias is comparably low among cohorts born in the 1960s and earlier, but rises significantly for later-born East German cohorts. This cohort pattern is consistent with a long-lasting “experience effect” of communist institutions and the reunification shock on beliefs, analogous to findings in the broader literature on the persistent effects of communism.

Q5: Is there evidence that individuals update their biased expectations over time? A: To assess learning, the authors use the panel dimension and compute for each individual in two consecutive survey waves the absolute value of the deviation between perceived and actual transition probabilities, then examine the change in this absolute deviation between waves. The histograms of individual-level changes show substantial dispersion but means close to zero in all four sub-groups (East/West, job separation/finding), indicating no systematic convergence of beliefs toward actual rates. Biases are also stable in the time-series dimension, with perceived and actual rates moving largely in parallel across survey waves from 1999 to 2015, leaving the aggregate bias level roughly constant.

Q6: How does the model rule out private information as an alternative explanation for the biases? A: If biases reflected private information about idiosyncratic risk not captured by observable characteristics, individual-level deviations between perceived and actual rates should predict subsequent realized transitions. The authors add the individual-level deviation as an additional regressor in the probit transition models. The estimated coefficients are statistically significant and positive, but quantitatively negligible: a 1 percentage point higher expected job separation probability is associated with only a 0.001 percentage point higher realized separation probability, and a 1 percentage point higher expected job finding probability with a 0.002 percentage point higher realized finding probability. These magnitudes are too small to materially alter the interpretation of the biases as reflecting systematic expectation errors rather than private information.

Q7: What is the role of contract length T in the model, and what is the critical threshold T?* A: The wage contract length T determines which of two opposing effects of pessimistic job separation expectations dominates in bargaining. The first (negative wage) effect: a pessimistic worker discounts future wages within the current contract more heavily than the firm does, so the worker values the contract less and accepts a lower wage. The second (positive wage) effect: a pessimistic worker also discounts the continuation value of future contracts more heavily, making it less attractive to remain in the match, so the firm must offer a higher wage to retain the worker. For short contract lengths (T < T*), the second (positive) effect dominates, so the pessimistic bias raises wages. For long contracts (T ≥ T*), the first (negative) effect dominates, so the pessimistic bias depresses wages. The critical threshold T* is the smallest positive integer such that T*/λw(θ) < β times a weighted sum involving σw and T*. Using calibrated parameter values for East Germany, T* = 10 quarters (2.5 years). The baseline contract length is T = 67 quarters (approximately 16.8 years), well above T*, placing the economy in the regime where pessimism depresses wages.

Q8: How does the optimistic job finding bias affect equilibrium wages and unemployment? A: An optimistic job finding bias (λw > p(θ)) raises the perceived value of unemployment U because workers expect to escape unemployment sooner. A higher value of unemployment raises the worker’s outside option in bargaining, increases the reservation wage, and thereby pushes up the bargained wage. In general equilibrium, the job creation condition (which is unaffected by worker expectations) is unchanged, so the upward rotation of the wage curve reduces labor market tightness θ, raises equilibrium unemployment, and extends average unemployment duration. This comparative static holds unambiguously for any contract length T.

Q9: What are the quantitative results of the counterfactual experiment assigning West German biases to East German workers? A: The counterfactual assigns West German bias levels (smaller pessimistic separation bias, larger optimistic finding bias) to East German workers while holding all other parameters at East German calibrated values. For the preferred calibration with γ ∈ {0.35, 0.50} and T ∈ {67, 106, 159}, wages in East Germany rise by 1.07 to 2.36 percent. This implies a reduction in the conditional East–West wage gap (23 percent) of 4.6 to 10.6 percent and a reduction in the unconditional gap (30 percent) of 3.6 to 7.9 percent. Equilibrium unemployment in East Germany rises by 0.70 to 1.01 percentage points as a side effect. Net of the unemployment effect, ex-ante unbiased expected lifetime income rises by 0.7 to 1.88 percent, confirming a positive welfare effect of reducing East German pessimism to West German levels. Under the biennial calibration robustness check, wage increases reach up to 3.3 percent, the conditional wage gap narrows by up to 11 percent, and lifetime income rises by up to 2.23 percent.

Q10: How is the bargaining power parameter γ calibrated and why does it matter for the results? A: The paper considers a range γ ∈ {0.35, 0.50, 0.65}, rather than a single calibrated value, because γ plays a crucial role in the sensitivity of wages to expectation biases. Lower bargaining power reduces the equilibrium wage directly; however, because lower wages spur job creation, the model requires a higher vacancy cost κ to match the empirical job finding rate, which in turn increases the elasticity of wages with respect to the bias (see the wage equation, which shows that the bias effect scales with κθ/p(θ)). The paper argues that γ = 0.65 is inconsistent with the empirical wage–bias relationship estimated in SOEP data (which is negative and about twice as negative in East Germany as in the West), while γ ∈ {0.35, 0.50} is consistent. Lower bargaining power is also argued to be realistic for East Germany given weaker union representation there relative to the West.

Q11: How does the empirical relationship between the job separation bias and wages serve as a model validation target? A: Using SOEP data, the authors regress log hourly wages on the individual-level difference between perceived and actual job separation rates, controlling for individual fixed effects and other covariates, and allow the slope to differ between East and West Germany. They find a statistically significant and negative relationship in both regions, with the effect approximately twice as large in East Germany as in the West. The estimate implies that if East German workers’ job separation pessimism were reduced to West German levels, hourly wages in the East would be about 1 percent higher. This empirical gradient is used as an external validation check — not a calibration target — to assess which combinations of (γ, T) in the model are quantitatively plausible.

Q12: What does the model predict about the general equilibrium effects on unemployment from reducing East German pessimism? A: Reducing East German pessimism — both the pessimistic separation bias and the low optimistic finding bias — shifts the wage curve upward in equilibrium. Because the job creation condition is unaffected by worker beliefs (firms have rational expectations), higher wages reduce the firm’s incentive to post vacancies, lowering labor market tightness θ. This leads to higher equilibrium unemployment and longer average unemployment duration. The counterfactual with West German biases implies that East German unemployment would rise by 0.70 to 1.01 percentage points, further widening the approximately 7 percentage point East–West unemployment gap. The authors note this is a welfare-relevant trade-off, but show that the wage gain dominates the unemployment cost in terms of expected lifetime income.

Q13: What robustness checks are performed on the quantitative results? A: The paper considers (i) a narrower definition of job separation (dismissals only) to match the most likely interpretation of the survey question; (ii) targeting the officially reported East German unemployment rate (14.5% average from the Federal Employment Agency) rather than the SOEP-implied rate of 8.6% as a calibration target; (iii) a biennial calibration frequency instead of quarterly. The main results — wage increases and narrowing of the wage gap — are quantitatively similar across these alternatives, with one exception: the biennial calibration yields substantially larger wage increases (up to 3.3%), a larger reduction in the conditional wage gap (up to 11%), and larger lifetime income gains (up to 2.23%).

Key Concepts

Expectation bias (job separation / job finding). In this paper, a bias in expectations is defined as a systematic average difference between an individual’s perceived transition probability and the actual (statistically predicted) transition probability for their demographic and job group. A pessimistic job separation bias means workers overestimate the probability of losing their job (σw > σ); an optimistic job finding bias means unemployed workers overestimate the probability of re-employment (λw > p(θ)). Biases are not attributed to private information but to systematic expectation errors.

Actual (statistical) transition probability. The paper defines actual transition probabilities not as raw sample transition rates but as individual-level predicted probabilities from probit models estimated on realized transitions within 24 months, conditional on a comprehensive set of individual, job, and employer characteristics observed at interview time. These are rounded to the nearest decile for comparability with the survey’s discrete response format.

Wage contract length (T). The contract length T is the number of periods for which a bargained wage is fixed before the match parties re-bargain. A job match consists of a sequence of consecutive wage contracts of length T. The paper departs from the standard DMP assumption of period-by-period bargaining (T = 1) and shows that T is central to how job separation expectations feed into the bargained wage. A permanent job approximates T → ∞.

Critical contract length (T).* A theoretically derived threshold: the pessimistic job separation bias raises equilibrium wages for contract lengths T < T* and depresses wages for T ≥ T*. Specifically, T* is the smallest positive integer such that T*/λw(θ) < β times a weighted sum involving β, σw, and T*. In the East German calibration, T* = 10 quarters.

Generalized Nash bargaining with common knowledge / agree to disagree. The model assumes that both the worker and the firm know each other’s perceived values of the job match and outside options and accept them as the basis for bargaining, even though they differ. Workers use their biased perceived transition rates to value employment and unemployment; firms use actual rates. There is no private information. The paper refers to this as workers and firms “agreeing to disagree.”

Ex-ante unbiased expected lifetime income (EI_{W,U}). A welfare measure defined as the present discounted value of income for an individual entering the economy, computed at actual (unbiased) job separation and job finding probabilities rather than at workers’ perceived (biased) rates. This measure captures the net welfare effect of changing expectation biases because it correctly accounts for actual employment transitions, even though the behavioral responses in equilibrium are driven by biased perceptions.

Effective discount factor (β(1 − σw)). When a worker holds pessimistic job separation expectations, future payoffs within the current contract are discounted not at the pure time discount factor β but at β(1 − σw), which is smaller when σw is larger. A more pessimistic worker therefore effectively discounts future wage payments more steeply, and this differential discounting relative to the firm (which uses β(1 − σ)) is the key mechanism generating the contract-length dependence of the wage effect.

De Gustibus and Disputes about Reference Dependence

Mon, 01 Jan 0001 00:00:00 +0000

This paper examines whether heterogeneity in individual gain-loss attitudes — the degree to which people weigh losses more or less severely than equivalent gains — contaminates prior tests of expectations-based reference dependence (EBRD). The central question is: do prior experiments that appear to yield mixed or null evidence against EBRD actually reflect a failure of the expectations-based reference point, or instead reflect a methodological flaw — the implicit assumption that all individuals are uniformly loss averse?

All prior tests of EBRD models (e.g., Kőszegi and Rabin 2006, 2007) have proceeded under what the authors call “universal loss aversion,” the assumption that every individual weighs losses more heavily than commensurate gains (λ > 1). The authors argue that this assumption — a form of the classic De Gustibus conjecture — is empirically incorrect and theoretically distorting: within EBRD designs, loss-averse and gain-seeking subjects are predicted to respond in opposite directions to expectations manipulations, so aggregating across them suppresses or reverses treatment effects.

The authors run two pre-registered laboratory experiments totaling 1,524 subjects. The labor supply experiment (N = 500, UC San Diego) uses a two-stage design. Stage 1 elicits each subject’s gain-loss attitude parameter λ_i from their effort responses to fixed versus uncertain piece rates in a real-effort transcription task, exploiting the prediction that loss-averse workers reduce effort under wage uncertainty while gain-seeking workers increase it. Stage 2 manipulates expectations by varying the probability of a high outside payment (p = 0.05 in Condition Low vs. p = 0.45 in Condition High), holding the piece-rate probability constant at 50%; under EBRD, this shifts the reference point and should change effort in a direction governed by λ_i.

The exchange experiment (N = 1,024, University of Bonn, with a pre-registered 2018 replication of N = 417) uses Stage 1 preference statements over randomly endowed objects to estimate λ_i, and Stage 2 manipulates expectations via a 0% vs. 50% probability of forced exchange. Under EBRD, loss-averse subjects should become more willing to exchange in the High condition; gain-seeking subjects should become less willing.

Both experiments document substantial heterogeneity in gain-loss attitudes. In the labor supply study, approximately 70.6% of subjects exhibit loss aversion (λ̂ > 1) and 29.4% exhibit gain-seeking (λ̂ < 1), with an average structural estimate of λ̂ = 1.65 and median 1.66. In the exchange study, 76% are loss averse and 24% are gain-seeking, with mean λ̂ = 1.49 and median 1.34. Lottery-based elicitation in the labor supply experiment yields 28% gain-seeking, consistent with prior literature estimates of roughly 22% gain-seeking from Chapman et al. (2018).

Crucially, Stage 1 gain-loss attitudes are strongly predictive of Stage 2 treatment effects in both experiments. In the labor supply study, the aggregate treatment effect of approximately 26% greater effort in Condition High — reproducing Abeler et al. (2011) — masks strongly heterogeneous responses: higher λ̂ predicts larger positive treatment effects (raw correlation ρ = 0.18, p < 0.01), and controlling for heterogeneous gain-loss attitudes raises R² by more than a factor of 10. In the exchange study, the aggregate treatment effect is precisely zero (coefficient = 0.00, clustered s.e. = 0.03), a result that prior literature would interpret as contradicting EBRD; but once gain-loss heterogeneity is accounted for, treatment effects are strongly positive for loss-averse subjects and negative for gain-seeking subjects, again raising R² by more than a factor of 10.

Gain-seeking subjects exhibit negative treatment effects in the exchange study, consistent with EBRD predictions, but in the labor supply study the average treatment effect for gain-seeking subjects remains slightly positive, representing a partial deviation from the model’s quantitative predictions. The authors interpret this as evidence that expectations-based reference points are an important but likely incomplete determinant of behavior, with attention-based, status-quo-based, or anchoring-based reference points potentially playing supplementary roles.

Q: What is the central methodological problem with prior tests of expectations-based reference dependence?

A: All prior tests assumed universal loss aversion — that every individual has λ > 1, i.e., weighs losses more severely than equivalent gains. The authors show this is both empirically wrong (roughly 24–29% of subjects are gain-seeking across both studies) and theoretically distorting: within EBRD designs, gain-seeking individuals are predicted to respond in the opposite direction from loss-averse individuals, so averaging across heterogeneous types can suppress, zero out, or even reverse the true treatment effect. This makes standard aggregate tests of EBRD unreliable.

Q: How do the authors measure gain-loss attitudes in the labor supply experiment?

A: In Stage 1, subjects make 30 effort decisions across fixed piece rates and uncertain piece rates with the same mean. Under the Kőszegi-Rabin CPE model, a loss-averse individual reduces effort when the wage is uncertain (because outcomes can fall below the reference point), while a gain-seeking individual increases effort under uncertainty. The authors estimate individual-level parameters by regressing log(e_i + 10) on log(w) and Δw/w in a random-coefficients framework; the coefficient l̂_i on Δw/w is the reduced-form measure of gain-loss attitudes, with λ̂_i = 1 + 4·(l̂_i/ĝ_i) as the structural estimate. The correlation between the two measures is ρ = 0.85 (p < 0.01).

Q: How do the authors measure gain-loss attitudes in the exchange experiment?

A: In Stage 1, subjects are randomly endowed with one of two objects and provide three unincentivized preference statements (relative liking, relative wanting, and hypothetical choice) before any possibility of exchange is introduced. Under CPE, an individual endowed with object X will prefer X to the extent that (1 + λ_i) − 2(Y/X) > 0, so subjects with higher λ_i should more strongly favor their endowment. A principal components analysis reduces the three statements to one factor (capturing ~70% of variation), and residuals from regressing that factor on object assignment constitute the reduced-form measure l̂_i. The structural estimate λ̂_i is obtained via a mixed logit using a log-normal distribution for λ_i; the reduced form and structural measures are correlated at r = 0.95 (p < 0.01).

Q: What does the distribution of gain-loss attitudes look like across the two experiments?

A: In the labor supply experiment (N = 453 estimable subjects), 70.6% are loss averse and 29.4% are gain-seeking, with mean λ̂ = 1.65 and median λ̂ = 1.66. In the exchange experiment (N = 1,024), 76% are loss averse and 24% are gain-seeking, with mean λ̂ = 1.49 and median λ̂ = 1.34. A separate lottery-based elicitation in the labor supply study finds 28% gain-seeking subjects. These proportions are consistent with the weighted average of 22% gain-seeking found by Chapman et al. (2018) across seven prior lottery-choice studies.

Q: What is the aggregate treatment effect in the labor supply experiment, and what does it look like once heterogeneity is accounted for?

A: Without accounting for gain-loss heterogeneity, Condition High is associated with roughly a 26% increase in effort relative to Condition Low (individual-clustered s.e. = 0.03, p < 0.01), reproducing the Abeler et al. (2011) result and consistent with EBRD under universal loss aversion. However, R² = 0.03. Once interactions of Condition High with l̂_i and λ̂_i are included, R² rises to 0.40 and 0.39 respectively — more than a tenfold increase. Higher λ̂_i predicts larger positive treatment effects (raw correlation ρ = 0.18, p < 0.01), and the interaction of Condition High with λ̂_i is highly significant (F(1,452) = 49.14, p < 0.01).

Q: What is the aggregate treatment effect in the exchange experiment, and what does it look like once heterogeneity is accounted for?

A: Without heterogeneity, the treatment effect of Condition High on the probability of exchanging is precisely 0.00 (clustered s.e. = 0.03), which prior literature would read as a failure of EBRD. Once heterogeneity is introduced via interactions with l̂_i and λ̂_i, the pattern changes markedly: loss-averse subjects show positive treatment effects (greater willingness to exchange in High), while gain-seeking subjects show negative treatment effects (less willingness to exchange in High), consistent with Predictions 4–6. R² again rises by more than a factor of 10. In Condition Low, 38% of subjects exchange, reflecting a significant endowment effect (F(1,1022) = 25.66, p < 0.01).

Q: Why does the aggregate treatment effect in the exchange experiment equal zero?

A: The authors show in Appendix B.4 that the relationship between λ_i and exchange probability treatment effects can be concave — negative effects for gain-seeking subjects can be of greater absolute magnitude than positive effects for loss-averse subjects. With roughly 24% gain-seeking and 76% loss-averse subjects, aggregation can yield a near-zero average even when heterogeneous effects are substantial and directionally consistent with EBRD. This aggregation problem, not a failure of the expectations-based reference point mechanism, explains the null aggregate result.

Q: Do gain-loss attitudes measured in one domain predict behavior in another domain?

A: The lottery-based measure of gain-loss attitudes (from Multiple Price Lists administered after the real-effort task in the labor supply experiment) has mean λ̂ = 1.48 and median 1.42, with 28% gain-seeking subjects — proportions similar to the labor supply estimates. However, the correlation between the lottery-based and labor-supply-based structural estimates of λ̂ is only Pearson’s r = 0.091 (p = 0.03) and Spearman’s ρ = 0.084 (p = 0.075). Furthermore, the lottery measure has no predictive power for Stage 2 treatment effects. This suggests that while the prevalence of gain-seeking is similar across domains, gain-loss attitudes at the individual level are more domain-specific than prior work has appreciated.

Q: How do the authors address the “generated regressor problem” when using estimated λ̂_i as a regressor?

A: Since λ̂_i is itself estimated from Stage 1 data, using it directly as a regressor in Stage 2 regressions treats imprecise preference estimates as ideal data, which can distort inference (the Murphy-Topel problem). The authors address this by bootstrapping the entire pipeline — re-estimating gain-loss attitudes from Stage 1 in each of 500 bootstrap iterations and re-running the Stage 2 regressions — then reporting the average bootstrap coefficient and its standard deviation. The bootstrapped conclusions are qualitatively identical to the original regression results in both experiments.

Q: What limitations do the authors acknowledge in the EBRD model’s fit?

A: Even after accounting for heterogeneity, the EBRD model does not provide a complete quantitative account of behavior. In the labor supply experiment, gain-seeking subjects exhibit slightly positive average treatment effects (not negative as predicted), and loss-averse subjects’ empirical treatment effects fall short of theoretical predictions, despite a significant correlation between predicted and empirical treatment effects (ρ = 0.25, p < 0.01). The authors attribute these deviations to potential measurement error (which would attenuate estimated relationships), and to the possibility that reference points have multiple determinants — including status quo-based, attention-based, and anchoring-based factors — beyond expectations alone.

Q: What are the broader implications for other applications of gain-loss attitudes?

A: The paper’s findings have implications for any application that relies on universal loss aversion as a maintained assumption, including Rabin’s (2000) calibration argument for risk aversion at small and large stakes, insurance demand for small losses (Slovic et al., 1977), and preferences for bunched resolution of uncertainty (Kőszegi and Rabin, 2009). Admitting heterogeneity in gain-loss attitudes will require more nuanced predictions in each of these settings. The paper provides a methodology — measuring individual-level gain-loss attitudes within the experimental context of interest — for investigating and controlling for such heterogeneity.

Q: What design features prevent confounds between Stage 1 measurement and Stage 2 treatment in the exchange experiment?

A: Stage 1 uses a different pair of objects (USB stick and pens) than Stage 2 (picnic mat and thermos), or vice versa — each subject encounters each pair exactly once, with counterbalancing at the session level. Stage 1 preference statements are unincentivized and made before any possibility of exchange is introduced, so they do not contaminate the Stage 2 expectations manipulation. The random reassignment of objects at the end of Stage 1 generates exogenous variation in endowments, preventing mechanical confounds. The authors also verify that interpreting Stage 1 variation as reflecting heterogeneity in object valuations (rather than gain-loss attitudes) would predict zero heterogeneous treatment effects in Stage 2 — a prediction rejected by the data.

Expectations-Based Reference Dependence (EBRD): The formulation, due to Kőszegi and Rabin (2006, 2007), in which an individual’s reference point is the entire distribution of outcomes they rationally expected, rather than a fixed status quo. Behavior is governed by a Choice-Acclimating Personal Equilibrium (CPE) in which the chosen action is optimal given that the expectation of that action serves as the reference.

Gain-Loss Attitudes (λ_i): The individual-specific parameter governing how outcomes above versus below the reference point affect utility. Under piecewise-linear gain-loss utility, an outcome that falls short of the reference by z reduces utility by η·λ_i·z, while an outcome above it raises utility by η·z. Loss aversion is λ_i > 1; gain-seeking is λ_i < 1; loss neutrality is λ_i = 1. In this paper, λ_i is treated as heterogeneous across individuals rather than assumed uniform.

Universal Loss Aversion: The implicit homogeneity assumption maintained in all prior tests of EBRD — that every individual has λ > 1. The authors characterize this as a form of the De Gustibus Non Est Disputandum conjecture applied to gain-loss attitudes, and document that it fails empirically in both experimental settings.

Choice-Acclimating Personal Equilibrium (CPE): The rational expectations equilibrium concept from Kőszegi and Rabin (2006, 2007) used throughout the paper to derive comparative statics. A choice is a CPE if its expected utility given its own expectation as the reference exceeds the expected utility of any alternative given that alternative’s expectation as the reference.

Reduced-Form Gain-Loss Measure (l̂_i): In the labor supply context, the individual-level OLS coefficient on Δw/w in a log-effort regression — capturing how strongly a subject reduces (or increases) effort under wage uncertainty relative to a fixed wage of equal mean. A positive l̂_i identifies loss aversion; negative identifies gain-seeking. In the exchange context, the analogous measure is the residual from regressing the first principal component of Stage 1 preference statements on object assignment.

Aggregation Problem: The paper’s central methodological contribution — when gain-loss attitudes are heterogeneous and the EBRD treatment effect is non-linear in λ_i, the average treatment effect across a heterogeneous population need not equal the treatment effect at the average λ. In the exchange experiment, the aggregate treatment effect is precisely zero even though loss-averse and gain-seeking subjects each respond in the theoretically predicted (opposite) direction, because the concave relationship between λ_i and the exchange probability treatment effect causes negative gain-seeking effects to dominate in the aggregate.

Eliciting Multiple Prior Beliefs

Mon, 01 Jan 0001 00:00:00 +0000

Multiple prior decision models—in which beliefs are represented by a set of probability measures rather than a single measure, generating a probability interval for each event—have become increasingly important in economics, but choice-based incentive-compatible elicitation of probability intervals remains an open problem: existing scoring rules and matching-probability methods cannot recover probability intervals without assuming probabilistic sophistication that is precisely least warranted in settings where multiple priors are most relevant. This paper develops a preference-based identification of a subject’s probability interval for an event, and a method for eliciting it under weak decision-theoretic assumptions with no need for probabilistic sophistication. Three incentivized experiments on artificial and natural sources of uncertainty demonstrate that the elicited intervals are sensitive to the direction and amount of information, are typically consistent with objective probabilities where available, and exhibit a predominance of non-degenerate probability intervals that are wider when there is less information or predictability. On aggregate, the choice-based intervals are similar to stated probability intervals, providing behavioral foundations for the use of stated interval techniques in the field.

Summary of a forthcoming paper, AI-assisted and human-reviewed. See the linked original for the authoritative claims and full conditions.

In depth

Q1. What is the key identification challenge for multiple prior elicitation?

The key challenge is that existing incentive-compatible elicitation methods—scoring rules and matching-probability approaches—confound a subject’s probability interval with their ambiguity attitude, so they cannot separately identify the probability interval without assuming probabilistic sophistication. Under the popular α-maxmin EU model, the matching probability of an event depends on both the subject’s probability interval and their ambiguity attitude parameter α; even eliciting both the event and its complement’s matching probabilities yields two equations in three unknowns. Probabilistic sophistication is least warranted precisely in settings with deep uncertainty where multiple priors are most relevant, making precision-laden methods unsuitable.

Q2. What is the paper’s elicitation solution?

The paper develops a preference-based method that identifies a subject’s probability interval under weak decision-theoretic assumptions—with no need for probabilistic sophistication—using a series of incentivized choices, and demonstrates its feasibility in three laboratory experiments. The approach comprises two components: (i) a preference-based identification theorem establishing the conditions under which the probability interval can be recovered from observable choices; and (ii) a concrete elicitation procedure that is incentive compatible and does not impose the precision-laden assumption of probabilistic sophistication.

Q3. What do the experiments show?

Three incentivized experiments on artificial and natural sources of uncertainty demonstrate that probability intervals elicited by the method are sensitive to the direction and amount of information, are typically consistent with objective probabilities where available, and predominantly non-degenerate—with intervals wider when there is less information or predictability. The sensitivity to information and consistency with objective probabilities provide external validation that the elicited intervals capture real beliefs rather than noise or confusion. The predominance of non-degenerate intervals (rather than point probabilities) indicates that subjects genuinely hold imprecise beliefs in the relevant settings.

Q4. What is the relationship between choice-based and stated probability intervals?

On aggregate, probability intervals elicited with the choice-based method are similar to those stated by subjects, suggesting that the new method can provide behavioral foundations for the use of stated probability-interval techniques that are widely used in field surveys but previously lacked incentive-compatible grounding. This convergence is informative because stated intervals are cognitively simpler and can be collected at large scale in surveys, while the choice-based intervals are theoretically grounded; the consistency between them justifies the use of simpler stated methods in field applications.

Key concepts

multiple priors : a model of beliefs in which a decision maker’s uncertainty is represented by a set of probability measures rather than a single measure; associated with the Gilboa-Schmeidler (1989) maxmin expected utility model and its generalizations; generates a probability interval for each event. probability interval : the interval [p(E), p̄(E)] of probability values a subject’s set of priors assigns to event E; non-degenerate (with width > 0) when the subject’s beliefs are genuinely imprecise. incentive-compatible elicitation : an elicitation procedure in which subjects’ optimal strategy is to report their true beliefs; for Bayesian single-prior beliefs, achieved by scoring rules and matching-probability methods, but these fail for multiple priors. probabilistic sophistication : the assumption that a multiple-prior agent’s set of priors is generated by precise probabilistic beliefs; existing methods require this assumption to disentangle the probability interval from ambiguity attitude, but the paper’s method does not.

Growth Experiences and Trust in Government

Mon, 01 Jan 0001 00:00:00 +0000

This paper investigates whether individuals who have experienced stronger GDP growth over their lifetimes are more likely to trust their national government. The authors — Besley, Dann, and Dray — assemble a newly harmonized global dataset comprising approximately 3.3 million respondents across 166 countries since 1990, drawn from 11 major opinion surveys (Afrobarometer, Americasbarometer, Arabarometer, Asiabarometer, European Social Survey, Gallup World Poll, Integrated Values Survey, Latinobarometer, Life in Transition Survey, South Asia Barometer, and World Justice Project). They supplement this with longer-run U.S. evidence from the American National Election Studies (ANES) going back to 1958, covering respondents born as early as the 1880s, and longitudinal Swiss evidence from the Swiss Household Panel (SHP) which allows individual fixed-effects estimation.

The core methodological contribution is the exploitation of country-cohort variation in lifetime GDP growth experiences. Following Malmendier and Nagel (2011), the authors construct a weighted average of past growth realizations across an individual’s lifetime, with weights decaying linearly over time (lambda = 1), so that more recent growth receives greater weight. The baseline specification includes country fixed effects, cohort-by-subcontinent fixed effects, survey-by-survey-year fixed effects, controls for log GDP per capita at year of birth, and individual characteristics (sex, marital status, education, religious denomination). More demanding specifications add country-by-survey-year and country-by-age fixed effects. For Switzerland, individual fixed effects are included, fully absorbing time-invariant personal characteristics.

The main finding is that a one standard deviation increase in lifetime GDP growth experience — corresponding to approximately 2 percentage points of additional growth — is associated with a 2.1 percentage point increase in the probability of trusting the national government, significant at the 1 percent level. This corresponds to roughly 0.042 standard deviations of the trust outcome and approximately 5 percent of the global mean trust in government. The effect is quantitatively meaningful: it approximates between one-quarter and one-half of the difference in average trust between older and younger cohorts in India and Italy, respectively. For the U.S. ANES sample, a one standard deviation increase in growth experience (about 0.2 percentage points) increases trust in the federal government by 2.4 percentage points, explaining more than two-thirds of the average trust gap between Baby Boomers (born 1946–1964) and Millennials (born 1981–1996).

Several scope conditions and heterogeneity findings sharpen the interpretation. First, the growth-trust link is specific to government institutions: there is no statistically significant effect of growth experience on interpersonal trust or trust in religious organizations, indicating the channel runs through perceptions of state performance rather than generalized social capital. Second, a recency heuristic operates: the linearly decaying weighting function (lambda = 1) outperforms both an unweighted lifetime average (lambda = 0) and a formative-years weighting. Growth experienced during formative years (ages 18–25) or before birth has no detectable effect on trust in government; the pre-birth result serves as a placebo test. Third, the positive growth-trust relationship is stronger in democracies than in autocracies, which the authors interpret as democracies producing citizens more responsive to government performance signals. Fourth, a “trust paradox” emerges: unconditionally, average trust in government is lower in democracies than in autocracies, and longer democratic experience is associated with lower trust, which the authors attribute to democratic institutions generating greater citizen skepticism about government performance. Fifth, core results are robust to controlling for other lifetime politico-economic experiences including inflation, banking and currency crises, epidemics, political unrest, executive turnover, stock market returns, and income inequality. The Swiss evidence further shows that private income growth experience does not drive the result — only aggregate macroeconomic growth does.

Q: What is the paper’s core quantitative finding on the growth-trust relationship? A: Using the global harmonized dataset of 3.3 million respondents across 166 countries, a one standard deviation increase in lifetime GDP growth experience (corresponding to approximately 2 percentage points of additional growth) is associated with a 2.1 percentage point increase in the probability of trusting the national government, significant at the 1 percent level. Using only the Gallup World Poll subsample (roughly half the observations), the estimated effect is somewhat larger at 3.6 percentage points per standard deviation increase. These estimates remain statistically significant under more demanding specifications with country-by-survey-year and country-by-age fixed effects, though the magnitudes decrease as these interacted fixed effects absorb variation in recent growth experiences.

Q: How do the authors measure individual lifetime growth experience? A: The growth experience variable is a weighted average of all past annual GDP per capita growth rates since an individual’s birth, with weights that decay linearly over time (lambda = 1 in the Malmendier-Nagel framework). Under this parameterization, the measure simplifies to how much recent economic performance (in the year prior to the survey) exceeds the long-run mean over the respondent’s lifetime, scaled by the respondent’s midpoint of life. This implies younger individuals are more sensitive to recent growth outcomes because their shorter life histories give recent events relatively greater weight. The authors validate this lambda = 1 choice via a grid search over alternative weighting structures using minimum residual sum of squares as the criterion.

Q: How is reverse causality addressed? A: The empirical strategy identifies the relationship using past, cumulative growth experiences measured prior to the survey, so current trust in government cannot cause past growth. Survey-year fixed effects absorb all aggregate time trends simultaneously affecting trust and growth. The authors also conduct a placebo test showing that GDP growth occurring before an individual’s birth has a precisely estimated null effect on their trust in government, which would not be the case if unobserved societal trends were jointly driving both growth histories and political perceptions.

Q: Does growth experience affect interpersonal trust or trust in non-state institutions? A: No. The estimated coefficient on lifetime growth experience is statistically insignificant at conventional levels when interpersonal trust replaces trust in government as the dependent variable, with narrow confidence intervals indicating a precisely estimated null. Similarly, growth experience has no systematic effect on trust in religious organizations such as churches or mosques. The authors interpret these null results as evidence against the alternative explanation that broad modernizing social changes are jointly driving both growth experiences and political trust.

Q: What do the U.S. ANES results add? A: The ANES data, which extends back to 1958 and captures cohorts born as early as the 1880s, provide a within-country test controlling for state fixed effects, generation dummies, and rich individual characteristics including partisan affiliation and partisan strength. A one standard deviation increase in U.S. growth experience (approximately 0.2 percentage points) raises trust in the federal government by 2.4 percentage points, significant at the 1 percent level. This estimate is quantitatively large enough to explain more than two-thirds of the average trust gap between Baby Boomers and Millennials. Results are robust to adding state-by-survey-year fixed effects and birth-state-by-generation fixed effects, and hold for a broader “trust in government index” covering beliefs about waste, corruption, and responsiveness of the federal government.

Q: What do the Swiss Household Panel results contribute? A: The SHP allows individual fixed-effects estimation, exploiting within-person changes in growth experience and trust over time from 1999 onward, which absorbs all time-invariant individual characteristics that could confound the global and U.S. cross-cohort results. The growth experience coefficient remains positive and significant, with a one standard deviation increase yielding a 1.9 percentage point increase in trust in the Swiss federal government (significant at the 1 percent level). The Swiss data also uniquely allow the authors to test whether personal income growth experience drives the result; they find no significant effect of private income growth experience on trust in government, only aggregate macroeconomic growth matters.

Q: Does the recency heuristic hold — does growth in formative years matter? A: No. The authors find no detectable effect of growth experienced specifically during formative years (ages 18–25) on trust in government. Additionally, in a grid-search exercise assessing model fit across different lambda values, the linearly decaying weighting scheme (lambda = 1, giving more weight to recent growth) outperforms both equal-weighted lifetime averages (lambda = 0) and weighting schemes that emphasize earlier life experiences (lambda less than 0). The pre-birth placebo result (null effect) and the absence of a formative-years effect together indicate that the operative mechanism is about evaluating current government performance based on recent macroeconomic experience, not the imprinting of long-lasting political dispositions during youth.

Q: What is the “trust paradox” and how is it documented? A: The trust paradox refers to the empirical finding that average trust in government is lower in democracies than in autocracies at the cross-country level, and that longer experience with democratic institutions within countries is associated with lower levels of trust in government in the micro data. This is counterintuitive given the standard view that good institutions should foster confidence in government. The authors suggest the paradox likely reflects democracies cultivating greater citizen skepticism and more critical judgment of government performance, rather than indicating that democratic governance actually performs worse. Importantly, the positive effect of growth experience on trust remains present in democracies, and the growth-trust relationship is actually stronger in democratic regimes, consistent with citizens in democracies being more responsive to government performance signals.

Q: How is the growth-trust finding related to corruption perceptions and living standards? A: Using the Gallup World Poll, the authors find that stronger lifetime growth experience is associated with lower perceived corruption in government, greater satisfaction with personal living standards, and higher likelihood of feeling one lives comfortably on one’s present income. These results are consistent with citizens attributing economic success to government competence and integrity, and with growth translating into perceptions of improved personal circumstances through both direct income effects and indirect public goods provision.

Q: Are the results robust to controlling for other lifetime politico-economic experiences? A: Yes. When the authors include lifetime experience measures for political unrest, executive turnover, epidemic exposure, banking crises, currency crises, and inflation (both levels and volatility) simultaneously in equation (3), the growth experience coefficient remains consistently positive, stable, and significant across all specifications. Among the other experience variables, only lifetime unrest and epidemic exposure are independently negative and statistically significant at conventional levels. F-tests reject the null hypothesis that the crisis and growth experience coefficients are equal in magnitude. The U.S. results are also robust to adding lifetime experiences with S&P 500 returns, unemployment, and top-income-share inequality measures.

Q: What are the policy implications of the findings? A: The authors note that sustained economic growth may itself be a mechanism for building political trust, with positive downstream effects for policy compliance — a connection they document has been relevant during the COVID-19 pandemic (where higher-trust societies showed lower mobility during lockdowns and higher vaccine acceptance). The growth-trust channel could have implications for increasing compliance across a range of policy domains including climate action and tax morale. Governments that deliver sustained economic growth can expect citizens to update their trust upward, particularly in democracies where citizens are more performance-responsive, while governments that preside over stagnation or contraction face predictable erosion of political legitimacy across cohorts.

Growth experience: A weighted average of all past annual GDP per capita growth realizations since an individual’s birth, with weights that decay linearly over time following Malmendier and Nagel (2011), so that more recent growth receives greater weight. Under the paper’s preferred parameterization (lambda = 1), the measure equals how much last year’s GDP per capita exceeds the respondent’s lifetime mean, scaled by the respondent’s midpoint of life.

Trust in government: A binary dummy variable equal to one if a survey respondent expresses “a great deal” or “quite a lot” of trust or confidence in the national government, constructed from harmonized responses across 11 major opinion surveys. The paper treats this as reflecting respondents’ perceptions of government performance rather than a deep interpersonal trust relationship.

Trust paradox: The empirical regularity documented in the paper whereby average trust in government is unconditionally lower in democracies than in autocracies at the cross-country level, and whereby longer democratic experience within countries is associated with lower individual trust in government. The authors attribute this to democratic institutions generating more critical citizen judgment of government performance.

Recency heuristic: The finding that more recent growth experiences carry greater weight in forming trust in government, as captured by the linear decay weighting scheme (lambda = 1) outperforming equal-weighted or early-life-weighted alternatives. Growth before birth and growth during formative years (ages 18–25) have no detectable effect, while recent macroeconomic performance is the operative signal.

Cohort-level variation: The within-country differences in lifetime growth experiences across birth cohorts that form the paper’s primary identification strategy. Because different cohorts in the same country have lived through different sequences of growth episodes, differences in trust across cohorts within a country can be attributed to differential growth exposure rather than time-invariant country characteristics.

Formative years effect: The hypothesis, tested and rejected in the paper, that economic experiences during ages 18–25 have a lasting imprint on political attitudes analogous to formative-years effects found in other political behavior literatures. The paper finds no statistically significant association between growth experienced during these years and trust in government.

Source text origin: In the pipeline context relevant to this paper’s acquisition, this refers to whether a summary was generated from full working paper text (“pdf” or “oa-html”) versus abstract only (which is hard-blocked). The working paper was obtained from LSE Research Online (eprint 129614), classified as published version under CC BY 4.0.

Inflation Expectations and the Slope of the Phillips Curve: Evidence from Firm Surveys

Mon, 01 Jan 0001 00:00:00 +0000

Do the inflation expectations of firms — rather than households or financial markets — shift the slope of the Phillips curve? Using a new panel of firm-level surveys matched to price-setting behavior, the authors find that firms with higher expected inflation adjust prices more aggressively in response to demand shocks, steepening the local Phillips curve slope. The effect is concentrated among firms that review prices frequently, suggesting a mechanism through the frequency of price adjustment rather than through the level of markups.

Summary of a forthcoming paper, AI-assisted and human-reviewed. See the linked original for the authoritative claims and full conditions.

In depth

Q1. What is the main empirical finding on expectations and the Phillips curve slope?

Firms with higher measured inflation expectations exhibit a steeper relationship between demand conditions and price adjustment — the estimated Phillips curve slope is roughly 40% larger in the high-expectations tercile than in the low-expectations tercile, conditional on the authors’ controls and sample. The authors interpret this as evidence that expectations are not merely a level shift in inflation but alter the sensitivity of prices to real activity, consistent with forward-looking pricing theories.

Q2. What is the mechanism, and how do the authors identify it?

The authors argue that expectations work through the frequency of price review: firms expecting higher inflation are more likely to be in an active review window, and so respond more to a given demand shock within that window. Identification relies on cross-firm variation in survey-measured expectations within narrow industry-time cells, so that aggregate demand shocks are held approximately fixed. The authors acknowledge this strategy absorbs industry-specific inflation trends and may understate the full expectational effect.

Q3. What does this imply for monetary policy?

If the Phillips curve slope varies with expectations, then a credible disinflation — by lowering expected inflation — flattens the curve and makes the output cost of reducing inflation larger, not smaller. The authors present this as a potential mechanism behind the observed flattening of the curve in low-inflation regimes, though they stop short of a structural welfare calculation.

Key concepts

Phillips curve slope: The coefficient linking excess demand (or unemployment gap) to inflation in the short-run Phillips curve — steeper means a given demand shortfall has a larger disinflationary effect.
price review frequency: How often a firm actively reconsiders its prices; firms that review more often are more likely to adjust in response to new information within any given period.
firm-level survey expectations: Inflation expectations measured directly from firms (rather than households or markets), which may better capture the beliefs that drive actual price-setting decisions.

Misspecified Expectations among Professional Forecasters

Mon, 01 Jan 0001 00:00:00 +0000

Analyzing panel data from the U.S. Survey of Professional Forecasters (SPF, 1992Q1–2019Q4, 77 forecasters, 1,520 forecaster-quarter observations), Julio Ortiz finds that a “misspecified expectations” model — in which forecasters perceive an AR(2) data-generating process to be an AR(1), causing them to misperceive its underlying persistence — tends to outperform a noisy-information rational benchmark and two leading non-FIRE alternatives (overconfident and diagnostic expectations) when fit to forecast errors and revisions. The models are estimated by maximum likelihood and ranked using forecast-encompassing weights; for the baseline real GDP growth case, misspecified expectations earns the largest encompassing weight (0.539 vs. 0.462 for diagnostic, ~0 for rational and overconfident) and the highest log-likelihood. Across 14 macroeconomic variables, misspecified expectations provides the best fit for most series both in-sample and out-of-sample, though diagnostic expectations fits better for some (e.g., GDP deflator, industrial production, real residential investment) and rational expectations fits the unemployment rate best. The author argues misspecified expectations succeeds in part because its bias enters both the prediction and updating equations, producing overreaction to new information plus overextrapolation across horizons, which makes forecast errors longer-lived; he concludes it can serve as a “suitable approach” / useful benchmark to model professional-forecaster expectation formation, while emphasizing the results are specific to the context of professional forecasting and may not carry over to household or firm expectations.

Summary of a forthcoming paper, AI-assisted and human-reviewed. See the linked original for the authoritative claims and full conditions.

In depth

Q1. What question does the paper address?

The paper undertakes a formal comparison of competing non-FIRE theories of expectation formation to move toward establishing a benchmark non-FIRE model in the context of professional forecasting. Ortiz motivates this with the observation that survey forecast errors are predictably correlated with real-time information — a violation of full-information rational expectations (FIRE) — but that, as noted in Reis (2020), the literature “has not yet settled on a benchmark non-FIRE model.” The paper offers “a partial answer to this question.”

Q2. What models are compared?

Four models are estimated: a noisy-information rational expectations baseline plus three biased non-FIRE models — overconfident expectations (Daniel et al., 1998), diagnostic expectations (Bordalo et al., 2020), and misspecified expectations (in the spirit of Fuster et al., 2010). All are embedded in a common noisy-information environment where the latent variable is unobservable and forecasters update via a Kalman filter from a noisy private signal. Overconfidence has forecasters misperceive their signal noise as smaller than it is; diagnostic expectations introduces a representativeness distortion ϕ > 0 generating overreaction to recent news; misspecified expectations has forecasters treat an AR(2) process as an AR(1).

Q3. What exactly is “misspecified expectations” in this paper?

Misspecified expectations is a model in which the underlying state follows an AR(2) process but forecasters treat it as an AR(1), so they misperceive the true persistence of the data-generating process. The author notes this version is “closest to natural expectations as modeled in Fuster et al. (2010),” with forecasters neglecting longer lags. Importantly, forecasters still understand the information structure. If the perceived persistence loads excessively onto the first lag, forecasters overextrapolate. The author flags three technical differences from Fuster et al. (2010): he does not model an AR(2) in levels with AR(1)-in-growth-rates forecasting; the perceived persistence is estimated from the data rather than defined as a function of the true autocorrelation parameters; and he does not define expectations as a weighted average of rational and naive AR(1) expectations.

Q4. What data and sample are used?

The estimation uses U.S. SPF panel data from 1992Q1 to 2019Q4, yielding 77 unique forecasters and 1,520 forecaster-quarter observations for the baseline. The 1992 start is chosen to avoid spanning different regimes and because the survey redefined output from GNP to GDP in 1992. The procedure requires unbroken observation sequences, so only each forecaster’s longest spell is kept, with a minimum spell length of eight quarters (because entry/exit may be non-random, per Engelberg et al., 2011). Real GDP growth is the baseline variable; 13 other macroeconomic variables are also estimated. Real-time forecast errors (not errors based on revised figures) are used, following the literature.

Q5. How are the models estimated and compared?

The models are estimated via a three-step maximum likelihood procedure, and their relative fit is compared using forecast-encompassing weights (West, 2001; Harvey et al., 1998; West, 2006), supplemented by AIC and a Vuong (1989) non-nested likelihood-ratio test. Step 1 estimates the fundamental process parameters (ρ₁, ρ₂, σ_w) from the macro time series and fixes them across models; step 2 estimates the signal-noise dispersion σ_v from the rational model and calibrates it across the other three; step 3 estimates each bias parameter (α_v, ϕ, ρ̂) by MLE on SPF data. This keeps fundamental and information parameters consistent across biased models so they are evaluated solely on the biases they generate, and makes identification transparent (notably, σ_v and α_v cannot be jointly identified in the overconfidence model). Encompassing weights are obtained from a constrained linear regression of realizations on model-based one-quarter-ahead forecasts, with weights summing to 1.

Q6. What are the baseline real GDP growth results?

For real GDP growth, the misspecified expectations model produces the highest log-likelihood and the largest encompassing weight, 0.539, versus 0.462 for diagnostic expectations and approximately 0.000 for both rational and overconfident expectations. The fundamental process estimates imply relatively low persistence (first-order autocorrelation ρ₁ ≈ 0.434, second-order ρ₂ ≈ −0.006). The estimated bias parameters are: overconfidence ≈ 0.72, diagnosticity ≈ 0.23, and perceived persistence ρ̂ ≈ 0.564. Because ρ̂ ≈ 0.56 exceeds the estimated ρ₁ ≈ 0.43, the misspecified model implies forecasters overestimate the first-order autocorrelation and neglect the partial reversal in the second lag, generating overreactions. The signal-to-noise ratio implied by the estimated private noise dispersion is σ_w/σ_v ≈ 1.09. AIC rankings (and BIC) do not change the ordering relative to the maximized likelihoods.

Q7. Does the result hold across other macroeconomic variables?

Across the 14 SPF macroeconomic variables, misspecified expectations provides the best in-sample fit for most series, but not all. Diagnostic expectations registers larger encompassing weights for certain series — the GDP deflator (0.771), industrial production (1.000), and real residential investment (0.624). Rational expectations provides the best fit for the unemployment rate (0.745) and housing starts (in-sample). For the bulk of the remaining variables (e.g., CPI 0.859, payroll employment 1.000, real consumption 0.777, real federal spending 1.000, real GDP 0.539, real nonresidential investment 1.000, real state/local spending 1.000, 3-month Treasury bill 0.713, 10-year bond 0.746), misspecified expectations carries the largest weight. Overconfident expectations “does not yield particularly large encompassing weights for any variable.”

Q8. Why does misspecified expectations fit better, and for which variables especially?

The author finds that, among variables exhibiting overreactions, misspecified expectations tends to offer a better fit for less persistent series, because the scope for it to generate overreaction (ρ̂ − ρ₁) is greater when ρ₁ is low. Unlike the alternatives, the persistence bias ρ̂ − ρ₁ can be positive or negative, allowing the model to account for both overreacting and underreacting variables; the alternative models cannot generate forecaster-level underreaction. Figure 2 plots the encompassing weight on misspecified expectations against the sum of autoregressive coefficients and suggests (with some exceptions) that less persistent variables have higher weight on misspecified expectations.

Q9. Does the model perform out of sample?

The misspecified expectations model also provides a better out-of-sample fit for more of the variables, estimated on 1992Q1–2005Q4 and evaluated on the latter half of the sample. However, out of sample diagnostic expectations now outperforms for the GDP deflator (0.987), industrial production (0.959), payroll employment (0.813), and real federal government expenditures (0.591); overconfident expectations outperforms for the 10-year government bond (0.653); and rational expectations outperforms for housing starts (0.502) and the unemployment rate (1.000). The author cautions that these results do not imply forecasters could improve their forecasts in real time, because the MLE observations include contemporaneous individual and consensus forecast errors that are not known to forecasters when they issue forecasts; for the same reason, the results are “not inconsistent with” Eva and Winkler (2023) on the poor out-of-sample performance of error-predictability regressions.

Q10. Could the apparent advantage of misspecified expectations just reflect learning?

The author argues that learning about the data-generating process does not appear to drive the relative model rankings in favor of misspecified expectations, based on two exercises. First, using the full pre-COVID sample (1968Q4–2019Q4) over 25-year rolling windows (three-year roll), the misspecified model outperforms diagnostic expectations in six of ten sub-samples and all models in five of ten, while diagnostic expectations wins four of ten — patterns that “do not indicate that learning over time favors misspecified expectations.” Second, splitting forecasters by “age”/tenure (a proxy for experience), misspecified expectations outperforms the others among experienced (above-median age) forecasters (encompassing weight 0.766, with overconfidence 0.234) and is dominant among inexperienced ones (1.000). The author concedes learning “is likely reflected in professional forecasts” but does not appear to drive the rankings.

Q11. What additional moments does misspecified expectations match?

Beyond overall fit, the author shows in the appendix that misspecified expectations matches five features of the data — overreaction, underreaction, overshooting, persistent disagreement, and updating behavior — and is the only model generating delayed overshooting. All three non-rational models generate individual-level overreaction (Bordalo et al., 2020 errors-on-revisions regression) and aggregate underreaction (Coibion-Gorodnichenko, 2015 consensus regression). But when simulating impulse responses, “only the misspecified expectations model generates a sign switch in the forecast error,” indicating delayed overshooting (Angeletos et al., 2020). The author reports “stronger evidence” favoring misspecified expectations on two further moments: it better generates persistent disagreement across horizons, and it better matches the relative weights forecasters place on priors versus news — because its bias also enters the prediction equation (not just the update equation), producing longer-lived errors.

Q12. What are the scope conditions and limitations the author stresses?

The author emphasizes that the results are specific to the context of professional forecasting and that the relative model rankings “may be different” for household or firm expectations, or for micro-level expectations rather than aggregate forecasts. He notes professional forecasters are arguably the most well-informed agents, so the literature has treated their predictions as informative about a lower bound on economy-wide information frictions and biases. The paper abstracts away from learning in the model setup and from theories that generate only underreaction. Models excluded from the comparison (e.g., imperfect memory, multi-frequency forecasting, asymmetric attention, learning) are set aside mainly because they cannot be flexibly nested into the common setting and would introduce additional parameters posing identification challenges.

Ortiz concludes that misspecified expectations “can serve as a suitable approach” / useful benchmark to model expectation formation among professional forecasters for a variety of macroeconomic aggregates, while framing this as only “a partial answer” to the search for a non-FIRE benchmark. He highlights a practical advantage: embedding this form of misspecified expectations into a quantitative model “only requires introducing two parameters into an otherwise standard model.” He also notes misspecification can arise either from a behavioral bias or because adopting parsimonious forecasting models is optimal (Branch and Evans, 2006; Pfajfar, 2013). A promising avenue for future research is whether evidence favors misspecified expectations in other settings.

Key concepts

Full-information rational expectations (FIRE): The benchmark in which forecast errors are uncorrelated with any information in the forecaster’s time-t information set; the orthogonality conditions it implies “tend to be violated in the data,” motivating non-FIRE models.
Misspecified expectations: The paper’s focal bias — the true state follows an AR(2) process, xₜ = ρ₁xₜ₋₁ + ρ₂xₜ₋₂ + wₜ, but forecasters treat it as an AR(1), xₜ = ρ̂xₜ₋₁ + uₜ, misperceiving its persistence; forecasters retain the correct information structure. The bias enters both the predict and update equations.
Persistence bias (ρ̂ − ρ₁): The gap between perceived AR(1) persistence and true first-order autocorrelation; positive values generate overextrapolation/overreaction, negative values generate underreaction, and its overreaction scope is larger when ρ₁ is low.
Overconfident expectations: Forecasters misperceive their private signal noise as smaller (σ̃_v = α_v σ_v, α_v ∈ [0,1]) than it truly is, placing excessive weight on new private information.
Diagnostic expectations: A representativeness-based distortion (Bordalo et al., 2020; Gennaioli-Shleifer, 2010) in which, with diagnosticity ϕ > 0, forecasters overweight outcomes representative relative to a “no news” reference scenario, generating overreaction to recent news.
Encompassing weight: The model-comparison metric — a weight wₖ from a constrained linear regression of realized one-quarter-ahead values on competing models’ forecasts, with weights summing to one; a larger weight indicates a better-fitting model.
Delayed overshooting: The Angeletos et al. (2020) pattern of initial underreaction followed by later overreaction to a shock; in this paper, only misspecified expectations produces the sign switch in the forecast-error impulse response that signals it.
Overreaction vs. underreaction: Individual-level overreaction is measured via the Bordalo et al. (2020) errors-on-revisions regression; aggregate/consensus-level underreaction via the Coibion-Gorodnichenko (2015) regression — the data exhibit both, and a successful non-FIRE model must reproduce both.

Narratives about the Macroeconomy

Mon, 01 Jan 0001 00:00:00 +0000

Layer 1 — Overview

Research Question

This paper investigates two related empirical questions in the context of the historic surge in US inflation in late 2021 and 2022: (1) What narratives—causal stories—do people invoke to explain why inflation increased? (2) How do those narratives shape economic expectations? A companion theoretical component asks how narrative heterogeneity affects aggregate macroeconomic outcomes.

Data and Methodology

The authors recruit more than 10,000 US households across five descriptive survey waves (November 2021, December 2021, January 2022, March 2022, May 2022) via Lucid, plus a separate expert survey of 111 academic economists with JEL-E publications in top journals, recruited simultaneously with the November 2021 household wave. Household samples are broadly representative of the US population in terms of gender, age, region, and income. The expert sample is highly credentialed: on average 18.6 years post-PhD, 2.7 top-five publications, and 5,534 Google Scholar citations.

Narratives are elicited through open-ended questions asking respondents to explain in their own words why inflation increased. Each text response is coded by two independent, blinded research assistants as a Directed Acyclic Graph (DAG) — a network of causal nodes representing factors (demand-side: government spending, monetary policy, pent-up demand, demand shift; supply-side: supply chain disruptions, labor shortage, energy crisis; miscellaneous: pandemic, government mismanagement, price gouging, Russia-Ukraine war) connected by directed causal edges. Inter-rater reliability is high: if one coder identifies a factor, the other does so 88% of the time; for specific causal connections between factors, agreement is 77%.

Three experiments study the causal effect of narratives on expectations: (1) A pent-up demand vs. energy crisis narrative provision experiment (April 2022, n=2,397 baseline, n=1,329 follow-up); (2) A monetary policy vs. energy crisis narrative provision experiment (June 2022, n=1,069 baseline, n=736 follow-up); (3) A 2×2 belief-updating experiment crossing narrative type (government spending vs. energy crisis) with information type (low vs. high government spending forecast) (April 2022, n=997).

Main Findings with Quantitative Magnitudes

Households’ narratives are substantially coarser than experts’: expert DAGs contain on average 4.3 factors and 3.6 causal links, while household DAGs contain only 3.5 factors and 2.8 links (both differences p < 0.01). Households focus predominantly on supply-side explanations: 57% invoke at least one supply-side factor vs. only 32% invoking any demand-side factor. The most common household narrative factors are supply chain disruptions (30%), labor shortage (27%), and general supply-side factors (22%); the leading demand-side factor is government spending, appearing in only 17% of household narratives, while loose monetary policy appears in just 5%. By contrast, 90% of experts invoke at least one supply-side factor and 84% at least one demand-side factor, with government spending mentioned by 50% of experts and monetary policy by 38%.

Among households who invoke at least one supply or demand narrative, only 34% mention both supply and demand factors; among the corresponding subsample of experts, 77% mention both. Government mismanagement—a politicized judgment of policy failure—appears in 32% of household narratives but only 1% of expert narratives. Price gouging appears in 8% of household narratives and 0% among experts.

Partisan polarization is large: Democrat-leaning respondents are 26 pp more likely to attribute inflation to the pandemic as a root cause (p < 0.01); Republican-leaning respondents are 38 pp more likely to blame government mismanagement (p < 0.01), and 19 pp more likely to mention high government spending (p < 0.01) and 14 pp more likely to mention high energy prices (p < 0.01).

Narratives are correlated with inflation expectations in OLS regressions controlling for demographics and survey wave fixed effects (n=2,951): households invoking government mismanagement predict 1.155 pp higher 1-year-ahead inflation (p < 0.01) and 0.805 pp higher 5-year-ahead inflation (p < 0.01). Energy crisis narratives predict 0.661 pp higher 1-year-ahead inflation (p < 0.01). Pent-up demand narratives predict 0.640 pp lower 5-year-ahead inflation (p < 0.05). Narrative variables explain approximately 10% of the out-of-sample variation in 1-year-ahead inflation expectations via LASSO, comparable to or exceeding the explanatory power of demographics and inflation experiences found in prior work.

In Experiment 1 (pent-up demand vs. energy crisis), providing the pent-up demand narrative reduces 12-month inflation expectations by 0.71 pp relative to the energy crisis treatment (p < 0.01, in the main survey), corresponding to 24% of a standard deviation. This effect persists in the follow-up survey one day later (−0.63 pp, p < 0.01).

In Experiment 2 (monetary policy vs. energy crisis), the monetary policy narrative reduces 12-month inflation expectations by 0.40 pp at the time of the main survey (p < 0.01) and by 0.62 pp in the follow-up (p < 0.01).

In Experiment 3 (information updating), respondents exposed to the government spending narrative increase 12-month inflation expectations by 1.79 pp in response to a high-spending forecast (p < 0.01), while those exposed to the energy crisis narrative show no significant reaction (0.34 pp, p = 0.205). In IV regressions instrumenting government spending expectations with the high/low forecast treatment, a 1 pp increase in perceived government spending growth raises inflation expectations by 0.378 pp among those holding the government spending narrative (p < 0.01) versus only 0.051 pp among those holding the energy narrative (p = 0.184; difference p < 0.01).

The New Keynesian DSGE model shows that a modest shift in perceived importance of monetary policy relative to productivity (raising ω_ν from 0.1 to 0.2, holding ω_g fixed) raises equilibrium consumption by 27 basis points and reduces equilibrium inflation by 27 basis points in the calibrated model with φ = 1.5; with a less reactive central bank (φ = 1.25), the same shift raises consumption by 30 basis points and reduces inflation by 62 basis points.

Scope Conditions

All empirical results are drawn from the US context during the 2021–2022 inflation surge. The authors note that the extent of partisan polarization in US narratives may not generalize to less politically polarized countries. The test-retest correlation of narrative factors across a three-day interval is 0.63 (p < 0.01), indicating significant but not perfect stability. The experiment results may partly reflect that narratives were especially malleable because the inflation surge was a relatively recent and salient phenomenon at the time of data collection.

Layer 2 — Q&A

Q1: How do the authors define and operationalize “narratives”?

A: The paper defines economic narratives as causal accounts for why an economic event occurred — agents’ assessments of cause-effect relationships across events. Each text response is coded as a Directed Acyclic Graph (DAG) where nodes are economic factors and directed edges represent perceived causal links. DAGs can represent both simple mono-causal accounts and complex multi-factor chains. The authors use a predefined coding scheme of 16+ factor categories spanning demand-side, supply-side, and miscellaneous nodes, with inflation as the terminal node.

Q2: What is the inter-rater reliability of the DAG coding, and what does it imply for the quality of the narrative data?

A: Two independent, blinded coders annotate each response. If one coder assigns a given factor, the other does so 88% of the time; for specific causal connections between factors, agreement is 77%. Approximately 95% of assigned factors and 89% of assigned connections make it to the final coded version. At the coarser level of “any demand-side factor,” agreement rises to 94%; for “any supply-side factor,” to 93%. Test-retest reliability across a three-day interval averages a correlation of 0.63 across all narrative factors (p < 0.01), comparable in magnitude to the measured persistence of economic preferences in prior work.

Q3: How do expert and household narratives differ in their structural complexity?

A: Expert DAGs contain on average 4.3 factors and 3.6 causal links, compared to 3.5 factors and 2.8 links for households (both p < 0.01). These differences persist even after controlling for response time and word count, indicating genuine differences in economic understanding rather than effort. Among agents who invoke at least one supply or demand factor, 77% of experts mention both, compared to only 34% of households.

Q4: What are the most prevalent factors in household narratives versus expert narratives, and why does this matter?

A: Supply chain disruptions (30%), labor shortage (27%), and general supply-side factors (22%) top household narratives, while monetary policy appears in only 5% of household DAGs. Expert narratives are more balanced: 90% cite supply-side factors and 84% cite demand-side factors, with government spending mentioned by 50% and monetary policy by 38%. This matters because factors with different persistence imply different trajectories for future inflation; households’ supply-side emphasis, combined with low awareness of monetary policy, shapes their inflation expectations in systematically different ways than experts.

Q5: What is the structure of household narrative clusters, and how fragmented are they?

A: Agglomerative hierarchical clustering using the Jaccard distance between DAG edge lists reveals 15 optimal clusters (Silhouette criterion), of which eight have at least 30 members. Four supply-side clusters account for 55% of households: pandemic-related supply chain disruptions (20%), general supply-side causes (18%), energy crisis often attributed to government mismanagement (11%), and labor shortages attributed to the pandemic or government spending (7%). The only clear demand-side cluster—combining government spending and loose monetary policy—captures just 8%. Simple mono-causal clusters attributing inflation to the pandemic alone (15%), government mismanagement alone (11%), and price gouging alone (4%) are collectively prominent, underscoring how fragmented and often single-factor household reasoning is.

Q6: How do partisan affiliations correlate with narrative content?

A: Republicans are 38 pp more likely than Democrats to attribute inflation to government mismanagement (p < 0.01), 19 pp more likely to mention high government spending (p < 0.01), and 14 pp more likely to mention high energy prices (p < 0.01). Democrats are 26 pp more likely to cite the pandemic as a root cause of inflation (p < 0.01) and more frequently cite pandemic-related supply chain issues and corporate greed. Government mismanagement appears in 32% of all household narratives (and is often portrayed as a root cause of spending, monetary policy, and energy prices) but in only 1% of expert narratives.

Q7: How did the composition of household narratives shift over time (November 2021 to May 2022)?

A: The energy crisis narrative rose sharply from 12% in January 2022 to 28% in March 2022, coinciding with Russia’s invasion of Ukraine in late February 2022. The Russia-Ukraine war narrative went from virtually zero before February 2022 to 28% in March 2022. By contrast, pandemic references, which climbed from 44% in November 2021 to 55% in January 2022, fell back to 47% in March 2022 and 39% in May 2022. Labor shortage references fell sharply from 32% in January 2022 to 15% in May 2022. These abrupt shifts suggest household narratives respond to major news events and, by extension, could drive rapid revisions in inflation expectations around such events.

Q8: What is the correlational evidence that narratives predict inflation expectations, and how large is the explanatory power?

A: OLS regressions on pooled data from November 2021–January 2022 (n=2,951), controlling for survey wave fixed effects and sociodemographics, show: government mismanagement narratives predict 1.155 pp higher 1-year inflation expectations (p < 0.01) and 0.805 pp higher 5-year expectations (p < 0.01); energy crisis narratives predict 0.661 pp higher 1-year expectations (p < 0.01); monetary policy narratives predict 1.005 pp higher 1-year expectations (p < 0.01); pent-up demand narratives predict 0.640 pp lower 5-year expectations (p < 0.05). LASSO out-of-sample prediction using DAG factor dummies and connection dummies explains approximately 10% of variation in 1-year-ahead inflation expectations — comparable to the 10% within-sample R² found by D’Acunto et al. (2021) for grocery price exposure, and substantially above the 2–7% found by Giglio et al. (2021) for investor characteristics explaining stock return expectations.

Q9: What does Experiment 1 (pent-up demand vs. energy crisis) show about the causal effect of narratives?

A: Providing the pent-up demand narrative (relative to the energy crisis narrative) increases the fraction of respondents invoking pent-up demand by 37.8 pp in the follow-up survey (baseline: 2.8%, p < 0.01) and reduces the fraction invoking the energy crisis by 7.9 pp (p < 0.01), establishing successful first-stage uptake. In the main survey (n=2,397), the pent-up demand treatment reduces 12-month inflation expectations by 0.71 pp relative to the energy treatment (p < 0.01), equivalent to 24% of a standard deviation; the effect persists at −0.63 pp in the follow-up one day later (p < 0.01). The energy crisis treatment has no significant effect on expectations relative to a pure control (−0.02 pp, p = 0.911), suggesting that energy crisis implications were already salient at the time.

Q10: What does Experiment 2 (monetary policy vs. energy crisis) add, given it was conducted after significant Fed tightening?

A: The experiment was run in June 2022, when 61% of respondents were already aware the Fed had raised rates. The monetary policy narrative increases the fraction invoking monetary policy by 39 pp and reduces the energy fraction by 50 pp relative to the energy group (both p < 0.01). The monetary policy narrative reduces 12-month inflation expectations by 0.40 pp in the main survey (p < 0.01) and 0.62 pp in the follow-up (p < 0.01). The mechanism is that attributing past inflation to loose monetary policy — which has since been tightened — leads respondents to infer lower future inflation, consistent with the narrative about persistence of the underlying cause.

Q11: What does Experiment 3 demonstrate about how narratives filter the interpretation of new information?

A: In the 2×2 design, all respondents first receive either a government spending narrative or an energy crisis narrative, then either a low (−4%) or high (+6%) government spending forecast from the Survey of Professional Forecasters. Among those with the government spending narrative, the high-spending forecast raises 12-month inflation expectations by 1.79 pp (p < 0.01); among those with the energy crisis narrative, the high-spending forecast raises inflation expectations by a non-significant 0.34 pp (p = 0.205). The IV estimate shows that a 1 pp increase in expected government spending growth raises inflation expectations by 0.378 pp for those holding the spending narrative (p < 0.01) vs. 0.051 pp for those holding the energy narrative (p = 0.184); this difference is highly significant (p < 0.01). Importantly, the first-stage effect on expected government spending growth is similar across narrative groups (4.7 pp vs. 6.8 pp, difference not significant), ruling out differential interpretation of the forecast itself as the mechanism.

Q12: How do the authors formalize narratives in the DSGE model, and what is the key mapping result?

A: Narratives are formalized as subjective causal models (SCMs): linear mappings from N observable factors to inflation, π_t = ψ_1(i)z_{1,t} + … + ψ_N(i)z_{N,t}, combined with perceived AR(1) processes for each factor. The “subjective inflation narrative” of agent i is summarized by perceived contribution shares ω_z(i). The paper’s Proposition 2 gives closed-form expressions for equilibrium inflation and consumption as functions of these perceived shares, without imposing that they be correct or identical across agents. The key result is that subjective causal models always affect equilibrium outcomes so long as the perceived persistence parameters differ across factors — the mechanism being that different narratives produce different inflation expectations, which feed back into consumption and pricing decisions.

Q13: What are the quantitative implications of narrative shifts in the calibrated DSGE model?

A: The baseline calibration uses standard New Keynesian parameters (β=0.99, γ=1, ς=5, Calvo price duration=4 quarters, φ=1.5, ρ_a=0.9, ρ_g=0.8, ρ_ν=0.5) with a scenario of a 10% productivity decline, 10% government spending increase, and policy rate 2 pp below the Taylor rule. Under rational expectations, π_t=3.68% and c_t=−11.79%. Raising the perceived importance of monetary policy in household and firm inflation narratives from ω_ν=0.1 to ω_ν=0.2 (lowering ω_a by the same amount, holding ω_g fixed) increases equilibrium consumption by 27 basis points and reduces equilibrium inflation by 27 basis points. With a less reactive central bank (φ=1.25), the same narrative shift raises consumption by 30 basis points and reduces inflation by 62 basis points. The paper notes that these effects are approximately linear in the narrative shift, meaning the directional implication holds across a wide range of narrative configurations.

Q14: How does narrative heterogeneity across households affect aggregate outcomes in the model?

A: When households hold heterogeneous narratives, aggregate outcomes depend on the joint distribution of perceived factor importance (ω_z(i)) and perceived factor persistence (ρ_z(i)) across agents, rather than on average values alone. Specifically, the model shows that if households who assign higher importance to a given factor also perceive that factor as more persistent, the aggregate effect on expectations and consumption is amplified beyond what the average narrative predicts. Additionally, narrative heterogeneity generates consumption heterogeneity even when the efficient allocation requires all households to consume the same amount, representing a welfare-relevant distortion absent under rational expectations.

Q15: What is the practical implication for central bank communication?

A: Under full-information rational expectations, central bank narrative communication about the drivers of inflation is irrelevant because agents already hold the correct model. Once subjective causal models can deviate from the truth, central bank narrative provision shifts aggregate equilibrium outcomes (inflation and consumption) in a benchmark New Keynesian model. The paper argues that central banks need to measure the distribution of household narratives to know whether their communication shifts agents toward or away from the rational expectations equilibrium — moving agents in the direction of the correct narrative produces better aggregate outcomes from the central bank’s perspective, conditional on inflation being above target and output below first-best.

Key Concepts

Economic Narrative (as used in this paper): An agent’s causal account for why a given economic event occurred — specifically, an assessment of cause-effect relationships that explains the drivers of an economic outcome. Distinguished from more general notions of “story” in that causality is the core; the paper does not count descriptions of correlation or simple statements of fact as narratives.

Directed Acyclic Graph (DAG) representation of narratives: Each narrative is coded as a network of factor nodes connected by directed edges indicating perceived causation. Acyclicity rules out feedback loops in a respondent’s causal account. Factors with nonzero ψ(i) are included; the direction of edges indicates causal flow. This representation allows quantitative comparison across respondents via adjacency matrices or Jaccard distances between edge lists.

Subjective Causal Model (SCM) of inflation: The paper’s formal theoretical counterpart to a narrative: a linear mapping π_t = Σ_n ψ_n(i) z_{n,t} in which individual i assigns perceived marginal effect ψ_n(i) to each factor z_n, combined with a perceived AR(1) law of motion for each factor. The SCM does not need to be correct or shared across agents. The rational expectations equilibrium is the special case where all agents’ SCMs match the true data-generating process.

Perceived contribution share (ω_z): The ratio ψ_z(i)·z_t / π_t — agent i’s perceived percentage contribution of factor z to current inflation. This is the sufficient statistic for the effect of household narratives on inflation expectations and, through the NK model, on equilibrium aggregate outcomes. The aggregate distribution of ω_z(i) and perceived persistence ρ_z(i) determines the consumption Euler equation at the aggregate level.

Government mismanagement (as a narrative factor): A coding category that captures explicit reference to policy failure or low-quality decision-making by policymakers in a politicized sense — distinct from the economic factors of government spending or monetary policy. It represents households’ attribution of inflation to the incompetence or malfeasance of officials, rather than to any specific economic mechanism. This factor appears in 32% of household narratives but only 1% of expert narratives.

Narrative cluster: A group of respondents whose DAGs are mutually similar (measured by Jaccard distance between edge lists) and whose typical DAG differs from other clusters. Identified via agglomerative hierarchical clustering. The paper identifies eight substantively meaningful clusters, ranging from supply-chain-focused to mono-causal pandemic or mismanagement narratives, with no single cluster capturing more than 20% of households.

Test-retest reliability of narratives: The correlation between the same respondent’s narrative elicited on two occasions three days apart. The paper estimates an average correlation of 0.63 across all narrative factors (p < 0.01), interpreted as indicating significant stability in households’ causal beliefs rather than survey noise. Comparable in magnitude to test-retest correlations of economic preferences in other studies.

Professional survey forecasts and expectations in DSGE models

Mon, 01 Jan 0001 00:00:00 +0000

This paper asks whether Survey of Professional Forecasters (SPF) data can be efficiently integrated into medium-scale DSGE models, and whether models with imperfectly rational expectations based on Adaptive Learning (AL) outperform the standard Rational Expectations (RE) hypothesis when survey forecasts are used as observables. The authors work with quarterly US data spanning 1981q2–2019q2, using the Philadelphia Fed Real-Time Data Set (first and second releases) alongside SPF nowcasts for inflation, consumption, investment, and output growth. The SPF nowcast is defined as a prediction formed in the middle of period t+1 for period t+1 given information for period t, making it a suitable proxy for the model-based expectation E_t y_{t+1}.

The core methodological contribution is a re-specification of structural shocks into persistent (AR) and transitory (i.i.d.) components. For the risk premium, investment-specific technology, government spending, and markup shocks, each shock is decomposed into two independent innovations, yielding 12 total structural innovations. A reduced-form VAR exercise motivates this: SPF nowcast innovations explain 19–33% of the 5-year forecast error variance of the macro variables and 44–71% of the variance of the nowcasts themselves. The 1-quarter RMSFE of the baseline RE model without SPF is 1.10 for inflation, 1.26 for consumption, 1.19 for investment, and 1.26 for GDP — all significantly exceeding the SPF RMSFEs of 0.21, 0.43, 1.49, and 0.35.

Log marginal likelihood improves monotonically as shocks are progressively re-specified: baseline RE (–577.37), RE with two-component markups (RE_mu, –536.63), adding real shocks stepwise (–473.29, –410.84), and finally all shocks (RE_all, –385.07). RE_all matches or beats SPF 1-quarter forecast accuracy (RMSFE ratio to SPF of 1.00 for inflation and investment; beats SPF for consumption growth), and Diebold-Mariano tests show no significant difference from SPF up to 5 quarters ahead. The paper further shows that once this two-component structure is imposed, exogenous sentiment shocks become unnecessary: RE_all (–385.07) outperforms RES_all (–388.17), and the RE model with all real shocks re-specified but without sentiment decisively dominates.

Three AL belief specifications are then estimated: MSVflex (full RE information set with an independently and rapidly updating constant, posterior autocorrelation 0.9937 — nearly a random walk), RBflex (restricted information set augmented with shock innovations, with meaningful time-variation of belief coefficients at rho_AL = 0.87), and HBflex (agents switch between MSV and RB based on past forecasting performance; average RB weight 0.34, weight sensitivity delta = 4.77). All AL models outperform RE_all: MSVflex (–381.38), HBflex (–355.09), RBflex (–351.59), with RB and HB yielding the largest gains particularly during and after the Great Financial Crisis.

AL models address three specific RE limitations. First, trend breaks: the ALM constant tracks persistent deviations, with ALM constants for consumption and investment successfully picking up rising macroeconomic trends in earlier sub-periods, yielding superior long-term forecasts. Second, time-varying transmission: the RB model generates cyclical volatility that stays lower in normal times and rises during distress, reducing reliance on large persistent investment-technology shocks relative to RE. Third, predictability of forecast errors: the RE model’s investment forecast inherits the SPF underreaction (b-coefficient 0.72, p < 0.001), while RBflex and HBflex reduce this to 0.17 and 0.34 respectively, both statistically insignificant.

On an extended sample including the Covid recession, the RBflex model underperforms because its restricted information set cannot handle abrupt complex dynamics; MSVflex and HBflex continue to perform well, with the MSV regime dominating in the HB model during Covid and post-Covid periods. Scope conditions: the dataset is US, 1981q2–2019q2 for baseline estimation; the predictability (underreaction) problem is confirmed only for investment SPF, not for inflation, consumption, or GDP growth in this sample.

Q1: What is the SPF nowcast, and why do the authors treat it as a proxy for model-based expectations? The SPF nowcast is defined as a prediction formed in the middle of quarter t+1 for the value of a variable in quarter t+1, conditional on information available through quarter t. Because agents are assumed to make decisions for period t and form expectations for t+1 based on information through t, this timing aligns precisely with the model-based conditional expectation E_t y_{t+1}. The authors use first-release data (r1) and the SPF nowcast (f0) both published in the course of t+1 as measurement variables, with the Kalman filter recovering implied structural shocks.

Q2: How large is the informational content of SPF nowcasts in reduced-form analysis? A 7-variable Cholesky VAR places each SPF series last, so the survey innovation is orthogonal to standard macro variables by construction. The 5-year forecast error variance decompositions show SPF nowcast shocks explain 19% of inflation variance, 33% of consumption variance, 33% of investment variance, and 29% of GDP variance (Table 1). The nowcasts themselves are explained 44–71% by their own innovations. SPF nowcasts also substantially outperform the baseline RE model: the RE model without SPF produces RMSFE ratios of 1.10 for inflation, 1.26 for consumption, 1.19 for investment, and 1.26 for GDP relative to SPF (all statistically significant by Diebold-Mariano test).

Q3: What is the shock re-specification, and why is it necessary to exploit survey data? The Smets-Wouters (2007) ARMA(1,1) shock structure conflates the transitory and persistent innovation into a single disturbance, making it impossible for the Kalman filter to separately attribute high-frequency and low-frequency movements. The re-specification splits each shock b_t into a persistent component b_t^ar (driven by epsilon^bar with persistence rho_b) and an i.i.d. transitory component b_t^iid (driven by epsilon^biid), yielding 12 total structural innovations. This allows survey nowcasts — which are forward-looking — to identify the persistent component separately from the transitory one. Without this, marginal likelihood improvements are far smaller (RE: –577 vs. RE_all: –385).

Q4: Does re-specification of real shocks render exogenous sentiment shocks redundant? Yes. Models with standard real shock processes but exogenous sentiment shocks (RES: –477.88; RES_mu: –488.96) do fit substantially better than models without sentiment (RE: –577.37; RE_mu: –536.63), confirming Milani’s (2017) result. However, once the two-component real shock structure is introduced, RE_all (–385.07) outperforms RES_all (–388.17) and the estimated sentiment shocks become small and explain little of the business cycle. The fundamental shock re-specification subsumes what sentiment shocks were previously capturing.

Q5: How do AL models compare to RE in terms of model fit? All three AL models outperform RE_all: MSVflex (–381.38, improvement of 3.69 log-likelihood units), HBflex (–355.09, improvement of 29.98 units), RBflex (–351.59, improvement of 33.48 units). The RB and HB specifications, which assume more severe deviation from RE with restricted information sets and time-varying transmission, achieve the largest gains. The MSV improvement accumulates gradually, concentrating in the late 1990s and 2000s, while RB shows sustained improvement in the 1980s and mid-1990s and performs exceptionally well during and after the GFC.

Q6: How does the AL mechanism handle macroeconomic trend shifts? Under RE with fixed coefficients, expectations anchor around a constant steady state, so persistent deviations from trend generate systematic forecast errors. Under AL, the ALM constant mu_t in the Actual Law of Motion evolves over the business cycle. In the MSVflex model, the autocorrelation parameter for the constant is estimated at 0.9937 (posterior mean), making it nearly a random walk that can track long-lasting trends. ALM constants for consumption and investment in the MSV setup successfully pick up rising macroeconomic trends in earlier sub-periods, translating into superior longer-term forecast performance relative to RE.

Q7: How does the RB model generate time-varying volatility, and why does this matter for investment dynamics? In RBflex, as beliefs are revised via the Kalman filter, the sensitivity of expectations and realized variables to shocks changes over the business cycle. The model generates cyclical volatility that remains lower in normal times and rises during distress — a realistic pattern absent from RE models. Consequently, RB does not need to rely as heavily on large persistent risk premium and investment-specific technology shocks: average volatility of these processes in the RB model does not increase in the last sub-period and remains generally lower across the whole sample, in contrast to RE’s behavior during the GFC. The RB model also shows a 3-times-smaller estimated measurement error in the investment SPF equation relative to the AL specification without restricted beliefs.

Q8: What happens to predictability of model-based forecast errors under AL versus RE? Using the Coibion-Gorodnichenko (2015) regression of forecast errors on forecast revisions, the RE model’s investment forecast shows a b-coefficient of 0.72 (p < 0.001), inheriting the underreaction documented in SPF investment data (b = 0.49, p = 0.006). AL models break this inheritance: RBflex ALM b-coefficient for investment is 0.17 (not statistically significant) and HBflex is 0.34 (not statistically significant). AL models achieve this because they relax the RE constraint of internal consistency between agents’ and model forecasts, allowing the ALM to generate efficient forecasts even when agent PLMs display sluggish adjustment.

Q9: How do the models perform during the Covid recession? The RBflex model does not perform optimally on the extended sample including the Covid recession. The authors attribute this to the restricted information set in the RB PLM being insufficient to describe the abrupt, complex macroeconomic dynamics of the Covid crisis. The MSVflex and HBflex models continue to perform well. In the HBflex model, the MSV regime naturally dominates during the Covid and post-Covid periods, while the RB regime had been more prominent between recessions in the pre-Covid sample.

Q10: What is the role of heterogeneous beliefs, and how do agents switch between PLMs? In HBflex, expectations are a weighted average of MSV and RB predictions with weights evolving as a function of past belief forecast errors. The weight sensitivity parameter is estimated at delta = 4.77, indicating weights are relatively sensitive to fitness. The average estimated weight on the RB PLM is 0.34 (MSV receives 0.66 on average). The RB weight tends to increase and reach its highest values between recessions, consistent with the restricted model being more parsimonious and useful in stable periods, while the fuller MSV model dominates in high-volatility episodes such as the Covid recession.

Q11: What are the out-of-sample forecasting results? The out-of-sample evaluation covers 2008q1–2019q2. The RB model outperforms the RE model in predicting investment and interest rate dynamics, and for investment it also outperforms professional forecasters during this period. At longer horizons (up to 5 quarters ahead), RE model forecasts are generally not statistically significantly different from SPF predictions once SPF nowcasts are included as observables, suggesting that observing the SPF data is sufficient to capture the most informative content from surveys for longer-horizon predictions.

Q12: What is the relationship to Milani (2017) and the prior literature on sentiment shocks? Milani (2017) found that exogenous sentiment shocks orthogonal to fundamentals were needed to fit SPF forecasts alongside an AL model and explained a significant portion of US business cycle fluctuations. The current paper shows this result is not robust to re-specifying fundamental shocks into persistent and transitory components: once the two-component structure is introduced, sentiment shocks become small and economically unimportant (RES_all at –388.17 versus RE_all at –385.07). What Milani attributed to sentiment was largely capturing the inability of single-innovation shocks to separately account for high-frequency and low-frequency variance.

SPF Nowcast as proxy for model expectations: The Survey of Professional Forecasters’ nowcast is defined as a prediction formed in the middle of quarter t+1 for the value of a variable in that same quarter, conditional on information available through quarter t. This timing makes it directly comparable to the model-based conditional expectation E_t y_{t+1}, so the SPF nowcast can be added to the DSGE model’s observable set with a straightforward measurement equation linking it to model expectations plus i.i.d. measurement error.
Shock re-specification into persistent and transitory components: Each structural shock (risk premium, investment-specific technology, government spending, and markup shocks) is decomposed into an AR(1) persistent component driven by epsilon^bar and an i.i.d. transitory component driven by epsilon^biid, replacing the ARMA(1,1) specification in Smets-Wouters (2007) that conflates both into a single innovation. This decomposition is the key technical device enabling survey data to separately identify low-frequency and high-frequency sources of volatility.
Adaptive Learning (AL): An expectation-formation mechanism in which agents do not know true model parameters and instead estimate linear forecasting models (PLMs) that are updated each period via a Kalman filter algorithm. This produces a time-varying Actual Law of Motion — transmission parameters mu_t, T_t, R_t all evolve with beliefs — enabling endogenous trend drift and time-varying shock responses absent from RE models with fixed coefficients.
Minimum State Variable (MSV) beliefs with flexible constant: An AL specification in which agents use the same endogenous state variables and shocks as in the RE solution but with the constant term updated at an independent, more rapid rate. The constant’s autocorrelation is estimated at 0.9937, making it nearly a random walk capable of tracking persistent macroeconomic trend deviations from the deterministic steady state.
Restricted Beliefs (RB): An AL specification in which each agent’s PLM uses a reduced information set — autoregressive terms of the forward-looking variable augmented with selected shock innovations — rather than the full RE state space. This more severe departure from RE yields the largest marginal-likelihood gain over RE_all, generates realistic cyclical volatility amplification, and produces a 3-times-smaller measurement error for investment SPF, but underperforms during the Covid recession due to the restricted set’s inability to handle abrupt complex dynamics.
Heterogeneous Beliefs (HB): An AL specification in which agents may switch between MSV and RB PLMs as a weighted average, with weights evolving as a function of past belief forecast errors. The average weight on RB is 0.34 and the weight sensitivity delta is estimated at 4.77; the RB weight tends to be highest between recessions and lowest during high-volatility episodes such as the Covid recession when the fuller MSV information set dominates.
FIRE predictability test (Coibion-Gorodnichenko regression): Under Full Information Rational Expectations, the regression of forecast errors on forecast revisions should yield a b-coefficient of zero. A positive and significant b indicates systematic underreaction to news. The paper confirms b = 0.49 (p = 0.006) for investment SPF — but not for inflation, consumption, or GDP — and shows the RE model inherits this inefficiency (b = 0.72, p < 0.001 for investment), while AL models reduce it to insignificance (RBflex: 0.17; HBflex: 0.34).

D84 | Macro Paper Warehouse

Biased expectations and labor market outcomes: Evidence from German survey data and implications for the East–West wage gap

Layer 1 — Overview

Layer 2 — Q&A

Key Concepts

De Gustibus and Disputes about Reference Dependence

Eliciting Multiple Prior Beliefs

In depth

Q1. What is the key identification challenge for multiple prior elicitation?

Q2. What is the paper’s elicitation solution?

Q3. What do the experiments show?

Q4. What is the relationship between choice-based and stated probability intervals?

Key concepts

Growth Experiences and Trust in Government

Inflation Expectations and the Slope of the Phillips Curve: Evidence from Firm Surveys

In depth

Q1. What is the main empirical finding on expectations and the Phillips curve slope?

Q2. What is the mechanism, and how do the authors identify it?

Q3. What does this imply for monetary policy?

Key concepts

Misspecified Expectations among Professional Forecasters

In depth

Q1. What question does the paper address?

Q2. What models are compared?

Q3. What exactly is “misspecified expectations” in this paper?

Q4. What data and sample are used?

Q5. How are the models estimated and compared?

Q6. What are the baseline real GDP growth results?

Q7. Does the result hold across other macroeconomic variables?

Q8. Why does misspecified expectations fit better, and for which variables especially?

Q9. Does the model perform out of sample?

Q10. Could the apparent advantage of misspecified expectations just reflect learning?

Q11. What additional moments does misspecified expectations match?

Q12. What are the scope conditions and limitations the author stresses?

Q13. What does the author conclude and recommend?

Key concepts

Narratives about the Macroeconomy

Layer 1 — Overview

Layer 2 — Q&A

Key Concepts

Professional survey forecasts and expectations in DSGE models