C5 | Macro Paper Warehouse

Mis(sed) Diagnosis: Physician Decision Making and ADHD

Mon, 01 Jan 0001 00:00:00 +0000

This paper develops and estimates a structural model of ADHD diagnosis to decompose the mechanisms driving the observed 2.3:1 male-to-female diagnostic difference in the United States. The research question is: to what extent does the large gender gap in ADHD diagnosis reflect true differences in symptom prevalence, versus patient-side utilization costs, versus physician decision-making under uncertainty? The setting is particularly well-suited to this question because DSM-V diagnostic guidelines for ADHD are explicitly gender-neutral, making any gender difference in physician thresholds a detectable deviation from uniform clinical rules.

The data come from de-identified electronic health records from a large Arizona healthcare system covering January 2014 through September 2017. The sample encompasses 36,193 unique encounters for approximately 11,070 pediatric patients. The raw male-to-female diagnostic ratio in the data is 2.32:1 (7.2% of males vs. 3.1% of females receive a clinical ADHD diagnosis). This gap persists after controlling for demographics, general healthcare utilization, and mental health utilization in reduced-form regressions, motivating the structural approach.

Because two key variables — whether a patient received a behavioral assessment (Qi) and the ADHD match signal observed by the physician (xi) — are not directly recorded in the EHR, the author constructs them from clinical doctor note text. A random forest machine learning classifier trained on labeled appointments predicts behavioral assessment take-up for unlabeled encounters; approximately 20.8% of children are predicted to have received a behavioral assessment (23.2% of males vs. 18.3% of females). The ADHD match signal is constructed via an adjusted Bag-of-Words cosine similarity measure comparing each patient’s aggregated note text to the DSM-V symptom list, rescaled to [0,1]. The average signal is 0.319 overall, with males averaging 0.326 and females 0.311.

The structural model has three stages. First, patients/caregivers decide whether to schedule a behavioral assessment, a function of underlying latent ADHD risk (vi) and mental healthcare utilization costs (ci). Second, conditional on assessment, the physician receives a noisy signal of vi and updates beliefs via Bayesian learning; signal quality ρ governs diagnostic uncertainty. Third, the physician diagnoses ADHD if posterior risk exceeds a gender-specific diagnostic threshold τ. Population mean ADHD risk (μ) is identified using regression-adjusted initial primary care provider referral rates as a quasi-exogenous cost-shifter — patients of high-referral-rate providers select into assessment less selectively, so their observed signals approach population mean risk. This extrapolation approach follows Arnold et al. (2022).

The structural parameter estimates reveal that male and female children have similar but slightly different mean ADHD risk (μm = 0.290 vs. μf = 0.262) and similar mean utilization costs (cm = 0.116 vs. cf = 0.109). The most striking differences are in physician parameters: signal quality is lower for male patients (ρm = 0.479 vs. ρf = 0.552), indicating higher diagnostic uncertainty for boys; and diagnostic thresholds are substantially lower for male patients (τm = 0.257 vs. τf = 0.312), meaning physicians are willing to diagnose ADHD in boys with lower posterior risk.

Counterfactual decomposition simulations attribute approximately 20–25% of the 2.32:1 diagnostic gap to underlying differences in ADHD risk, approximately 20% to differences in selection into behavioral assessments, and the remaining majority — approximately 55–60% — to physician decision-making. Within physician decision-making, differences in diagnostic thresholds alone account for roughly two-thirds of the overall diagnostic gap.

The paper offers economic rationales for why gender-specific thresholds may be consistent with physician rationality despite uniform guidelines: higher diagnostic uncertainty for boys justifies lower thresholds under Bayesian updating; hyperactive/impulsive symptoms predominant in boys impose larger classroom externalities (Aizer, 2008); and female patients show higher rates of internalizing co-morbidities (anxiety, depression) that may reduce the marginal benefit of an additional ADHD diagnosis. A type-specific threshold extension finds that for male patients the threshold for hyperactive/impulsive symptoms is significantly lower than for inattentive symptoms, consistent with salience of externally disruptive behaviors. These rationalizations do not vindicate the gap as fully guideline-consistent, but suggest physicians may be responding to real heterogeneity in external costs and co-morbidity patterns.

Q: What is the main research question and why is ADHD a useful setting? A: The paper asks what mechanisms produce the 2.3:1 male-to-female ADHD diagnostic difference: true symptom prevalence, patient utilization costs, or physician decision-making. ADHD is well-suited because (1) clinical guidelines (DSM-V) are explicitly gender-neutral and require the same symptom count threshold regardless of sex; (2) diagnosis is based on subjective behavioral assessment rather than objective testing, creating substantial physician discretion; and (3) both missed and excess diagnosis carry meaningful costs — missed diagnosis limits educational accommodations; excess diagnosis exposes children to Schedule II controlled substances.

Q: What data does the paper use and what are the key descriptive facts? A: The data are de-identified electronic health records from a large Arizona healthcare system, 2014–2017, covering 36,193 encounters for 11,070 pediatric patients aged 5 and above. Overall ADHD diagnosis rate is 5.2%, with males at 7.2% and females at 3.1%, a 2.32:1 ratio that matches national levels. Approximately 49.5% of the sample is Hispanic, which the author notes contributes to a below-national-average overall diagnosis rate. The gender diagnostic gap persists even after controlling for demographics, general healthcare utilization, and mental health utilization in reduced-form regressions.

Q: How does the paper construct the behavioral assessment indicator (Qi) and the ADHD match signal (xi)? A: Qi is constructed using a random forest classifier trained on doctor notes from appointments where assessment status is known with near-certainty (ADHD diagnosis or DSM-V comorbid diagnosis = positive; non-mental-health diagnosis code for patients with no mental health history = negative). The classifier uses 41 features including note length and top-20 word frequencies for each label class. xi is constructed via an adjusted Bag-of-Words cosine similarity between each patient’s combined behavioral assessment notes and the DSM-V symptom list, separately for inattentive and hyperactive/impulsive sub-types, taking xi = max{xi1, xi2}. The average xi is 0.319 (males 0.326, females 0.311) in the behavioral assessment subsample.

Q: What is the identification strategy for recovering population mean ADHD risk (μ)? A: Because xi is observed only for endogenously selected patients, the observed sample mean overestimates population mean risk. The author uses regression-adjusted referral rates of each patient’s initial primary care provider (IPCP) as a quasi-exogenous cost-shifter satisfying (a) relevance — IPCP referral intensity lowers patient scheduling costs — and (b) independence from patient ADHD risk vi, since IPCPs are typically chosen before behavioral symptoms develop and only 28% of IPCPs in the sample ever diagnose ADHD themselves. Population mean risk is then recovered by extrapolating the relationship between IPCP referral propensity and average observed xi to propensity = 1, following Arnold et al. (2022). The maximum observed IPCP referral propensity is only about 0.75, so the estimate requires extrapolation beyond the observed support.

Q: What are the estimated structural parameters and what do they imply? A: Mean ADHD risk is μm = 0.290 vs. μf = 0.262 — males have modestly higher underlying risk. Mean utilization costs are cm = 0.116 vs. cf = 0.109 — nearly identical across genders. Signal quality (diagnostic certainty) is lower for males: ρm = 0.479 vs. ρf = 0.552, indicating physicians face more diagnostic uncertainty when assessing boys. Most importantly, diagnostic thresholds are lower for males: τm = 0.257 vs. τf = 0.312, meaning physicians diagnose ADHD in boys at a lower required posterior risk level, consistent with viewing missed diagnosis as relatively more costly for male patients.

Q: How much of the 2.32:1 diagnostic gap can be attributed to each mechanism? A: Counterfactual simulations decompose the gap as follows: differences in underlying ADHD risk distribution account for approximately 20–25% of the diagnostic difference; differences in selection into behavioral assessments (utilization costs operating through assessment rates) account for approximately 20%; and physician decision-making differences account for the remaining majority, approximately 55–60%. Within physician factors, differences in diagnostic thresholds (τm < τf) are the single largest contributor, explaining roughly two-thirds of the overall male/female diagnostic gap.

Q: What do the type-specific threshold estimates reveal? A: When the baseline model is extended to allow separate diagnostic thresholds for inattentive vs. hyperactive/impulsive symptom sub-types, male patients show significantly lower thresholds for hyperactive/impulsive symptoms relative to inattentive symptoms (τ^HI_m < τ^Inatt_m). This is consistent with the hypothesis that more externally salient and disruptive symptoms carry larger classroom externalities, which physicians may implicitly factor into diagnosis decisions (following Aizer, 2008). For female patients, the threshold differences across symptom types are smaller and less statistically significant.

Q: What economic rationales does the paper offer for gender-specific diagnostic thresholds despite uniform guidelines? A: Three mechanisms are identified. First, higher diagnostic uncertainty for males (lower ρm) implies that under symmetric costs, Bayesian-rational physicians should set lower thresholds when the signal is noisier — this alone partially rationalizes the threshold gap. Second, hyperactive/impulsive symptoms predominant in boys impose greater externalities on classroom peers (Aizer, 2008), increasing the social benefit of diagnosis for boys on the margin. Third, females show substantially higher rates of co-morbid internalizing conditions (anxiety, depression) whose treatment may mitigate ADHD-related behaviors or whose interaction with stimulant medication makes the marginal ADHD diagnosis less beneficial for girls (Currie et al., 2014). These factors together suggest physicians may be responding to genuine heterogeneity in net diagnosis benefits, even if their behavior deviates from gender-neutral clinical guidelines.

Q: What share of the 2.3:1 national diagnostic gap is consistent with genuine symptom prevalence differences? A: Simulations indicate that only about 20–25% of the 2.32:1 male/female diagnostic difference can be explained by the underlying difference in ADHD risk distributions. The majority — roughly 75–80% — reflects factors beyond true prevalence: selection into care and, most substantially, physician decision-making differences including both signal quality and diagnostic thresholds.

Q: What are the policy implications? A: The findings suggest that targeted interventions in physician awareness and clinical training are likely more effective than generic awareness campaigns, since the dominant driver of the diagnostic gap is physician threshold-setting rather than symptom prevalence. Structured decision support tools or updated training that make physicians aware of gender-specific diagnostic patterns could reduce medically unwarranted diagnostic differences. Policies targeting patient-side access barriers (the ~20% explained by selection) remain relevant but secondary. The roughly 20–25% of the gap attributable to genuine symptom prevalence differences is, by construction, guideline-consistent and should not be targeted for elimination.

Q: What are the methodological contributions? A: The paper makes three methodological contributions. First, it develops a structural model of mental health diagnosis that explicitly incorporates endogenous patient selection — a feature absent from standard physician decision-making models — which is shown empirically important. Second, it applies machine learning and NLP to clinical doctor note text to construct key unobserved clinical variables (behavioral assessment indicator and ADHD match signal) that are unavailable as structured data in EHRs. Third, the identification of population mean health risk uses a quasi-exogenous variation approach (IPCP referral rates) analogous to Arnold et al. (2022)’s method for measuring racial discrimination in bail decisions, adapted here to a continuous health risk setting with endogenous selection.

Diagnostic threshold (τ_θ): The gender-specific posterior ADHD risk level above which a physician chooses to diagnose ADHD. Set ex-ante, it reflects the physician’s perceived tradeoff between the costs of over-diagnosis (misdiagnosis) and under-diagnosis (missed diagnosis). A lower threshold implies the physician views missed diagnosis as relatively more costly for that patient group. By construction, uniform clinical guidelines imply a single threshold independent of patient gender.

ADHD match signal (x_i): A physician-observed, noisy signal of a patient’s true latent ADHD risk (v_i), observed only conditional on the patient receiving a behavioral assessment. In estimation, it is proxied via a cosine similarity measure between the patient’s aggregated clinical doctor note text and the DSM-V symptom list, constructed separately for inattentive and hyperactive/impulsive sub-types.

Signal quality / diagnostic uncertainty (ρ_θ): The correlation between the physician’s observed ADHD match signal and the patient’s true ADHD risk. Higher ρ means the physician’s signal is more informative and diagnostic uncertainty is lower. In the Bayesian updating framework, higher ρ implies the physician places more weight on the observed signal relative to the prior.

Mental healthcare utilization cost (c_i): The composite of all patient/caregiver factors that affect the decision to schedule a behavioral assessment net of child symptom level. Includes non-monetary barriers such as time constraints, distance, stigma, and information from primary care providers during wellness visits; does not include monetary out-of-pocket costs since insurance typically covers behavioral assessments.

Initial Primary Care Provider (IPCP) referral rate: The regression-adjusted share of a given PCP’s patients who ultimately receive a behavioral assessment at some point in the sample. Used as a quasi-exogenous cost-shifter that influences patient scheduling costs without being correlated with patient ADHD risk, enabling identification of population mean ADHD risk via extrapolation.

Latent ADHD risk (v_i): An unobserved continuous measure of a child’s underlying ADHD-related behavioral symptoms, drawn from a gender-specific normal distribution N(μ_θ, σ²_θ). A child’s true ADHD status is Si = 1(v_i > v̄), where v̄ is the DSM-V minimum symptom threshold, defined identically for boys and girls.

Adjusted Bag-of-Words (BOW) cosine similarity: The NLP method used to construct the ADHD match signal proxy. Patient notes are tokenized into uni-grams and bi-grams after preprocessing (spell check, abbreviation replacement, part-of-speech tagging, synonym replacement), and tf-idf weighted. The cosine similarity between the resulting document vector and the DSM-V symptom text vector is computed separately for each ADHD sub-type and rescaled to [0,1].

Professional survey forecasts and expectations in DSGE models

Mon, 01 Jan 0001 00:00:00 +0000

This paper asks whether Survey of Professional Forecasters (SPF) data can be efficiently integrated into medium-scale DSGE models, and whether models with imperfectly rational expectations based on Adaptive Learning (AL) outperform the standard Rational Expectations (RE) hypothesis when survey forecasts are used as observables. The authors work with quarterly US data spanning 1981q2–2019q2, using the Philadelphia Fed Real-Time Data Set (first and second releases) alongside SPF nowcasts for inflation, consumption, investment, and output growth. The SPF nowcast is defined as a prediction formed in the middle of period t+1 for period t+1 given information for period t, making it a suitable proxy for the model-based expectation E_t y_{t+1}.

The core methodological contribution is a re-specification of structural shocks into persistent (AR) and transitory (i.i.d.) components. For the risk premium, investment-specific technology, government spending, and markup shocks, each shock is decomposed into two independent innovations, yielding 12 total structural innovations. A reduced-form VAR exercise motivates this: SPF nowcast innovations explain 19–33% of the 5-year forecast error variance of the macro variables and 44–71% of the variance of the nowcasts themselves. The 1-quarter RMSFE of the baseline RE model without SPF is 1.10 for inflation, 1.26 for consumption, 1.19 for investment, and 1.26 for GDP — all significantly exceeding the SPF RMSFEs of 0.21, 0.43, 1.49, and 0.35.

Log marginal likelihood improves monotonically as shocks are progressively re-specified: baseline RE (–577.37), RE with two-component markups (RE_mu, –536.63), adding real shocks stepwise (–473.29, –410.84), and finally all shocks (RE_all, –385.07). RE_all matches or beats SPF 1-quarter forecast accuracy (RMSFE ratio to SPF of 1.00 for inflation and investment; beats SPF for consumption growth), and Diebold-Mariano tests show no significant difference from SPF up to 5 quarters ahead. The paper further shows that once this two-component structure is imposed, exogenous sentiment shocks become unnecessary: RE_all (–385.07) outperforms RES_all (–388.17), and the RE model with all real shocks re-specified but without sentiment decisively dominates.

Three AL belief specifications are then estimated: MSVflex (full RE information set with an independently and rapidly updating constant, posterior autocorrelation 0.9937 — nearly a random walk), RBflex (restricted information set augmented with shock innovations, with meaningful time-variation of belief coefficients at rho_AL = 0.87), and HBflex (agents switch between MSV and RB based on past forecasting performance; average RB weight 0.34, weight sensitivity delta = 4.77). All AL models outperform RE_all: MSVflex (–381.38), HBflex (–355.09), RBflex (–351.59), with RB and HB yielding the largest gains particularly during and after the Great Financial Crisis.

AL models address three specific RE limitations. First, trend breaks: the ALM constant tracks persistent deviations, with ALM constants for consumption and investment successfully picking up rising macroeconomic trends in earlier sub-periods, yielding superior long-term forecasts. Second, time-varying transmission: the RB model generates cyclical volatility that stays lower in normal times and rises during distress, reducing reliance on large persistent investment-technology shocks relative to RE. Third, predictability of forecast errors: the RE model’s investment forecast inherits the SPF underreaction (b-coefficient 0.72, p < 0.001), while RBflex and HBflex reduce this to 0.17 and 0.34 respectively, both statistically insignificant.

On an extended sample including the Covid recession, the RBflex model underperforms because its restricted information set cannot handle abrupt complex dynamics; MSVflex and HBflex continue to perform well, with the MSV regime dominating in the HB model during Covid and post-Covid periods. Scope conditions: the dataset is US, 1981q2–2019q2 for baseline estimation; the predictability (underreaction) problem is confirmed only for investment SPF, not for inflation, consumption, or GDP growth in this sample.

In depth

Q1. What is the SPF nowcast, and why do the authors treat it as a proxy for model-based expectations?

The SPF nowcast is defined as a prediction formed in the middle of quarter t+1 for the value of a variable in quarter t+1, conditional on information available through quarter t. Because agents are assumed to make decisions for period t and form expectations for t+1 based on information through t, this timing aligns precisely with the model-based conditional expectation E_t y_{t+1}. The authors use first-release data (r1) and the SPF nowcast (f0) both published in the course of t+1 as measurement variables, with the Kalman filter recovering implied structural shocks.

Q2. How large is the informational content of SPF nowcasts in reduced-form analysis?

A 7-variable Cholesky VAR places each SPF series last, so the survey innovation is orthogonal to standard macro variables by construction. The 5-year forecast error variance decompositions show SPF nowcast shocks explain 19% of inflation variance, 33% of consumption variance, 33% of investment variance, and 29% of GDP variance (Table 1). The nowcasts themselves are explained 44–71% by their own innovations. SPF nowcasts also substantially outperform the baseline RE model: the RE model without SPF produces RMSFE ratios of 1.10 for inflation, 1.26 for consumption, 1.19 for investment, and 1.26 for GDP relative to SPF (all statistically significant by Diebold-Mariano test).

Q3. What is the shock re-specification, and why is it necessary to exploit survey data?

The Smets-Wouters (2007) ARMA(1,1) shock structure conflates the transitory and persistent innovation into a single disturbance, making it impossible for the Kalman filter to separately attribute high-frequency and low-frequency movements. The re-specification splits each shock b_t into a persistent component b_t^ar (driven by epsilon^bar with persistence rho_b) and an i.i.d. transitory component b_t^iid (driven by epsilon^biid), yielding 12 total structural innovations. This allows survey nowcasts — which are forward-looking — to identify the persistent component separately from the transitory one. Without this, marginal likelihood improvements are far smaller (RE: –577 vs. RE_all: –385).

Q4. Does re-specification of real shocks render exogenous sentiment shocks redundant?

Yes. Models with standard real shock processes but exogenous sentiment shocks (RES: –477.88; RES_mu: –488.96) do fit substantially better than models without sentiment (RE: –577.37; RE_mu: –536.63), confirming Milani’s (2017) result. However, once the two-component real shock structure is introduced, RE_all (–385.07) outperforms RES_all (–388.17) and the estimated sentiment shocks become small and explain little of the business cycle. The fundamental shock re-specification subsumes what sentiment shocks were previously capturing.

Q5. How do AL models compare to RE in terms of model fit?

All three AL models outperform RE_all: MSVflex (–381.38, improvement of 3.69 log-likelihood units), HBflex (–355.09, improvement of 29.98 units), RBflex (–351.59, improvement of 33.48 units). The RB and HB specifications, which assume more severe deviation from RE with restricted information sets and time-varying transmission, achieve the largest gains. The MSV improvement accumulates gradually, concentrating in the late 1990s and 2000s, while RB shows sustained improvement in the 1980s and mid-1990s and performs exceptionally well during and after the GFC.

Q6. How does the AL mechanism handle macroeconomic trend shifts?

Under RE with fixed coefficients, expectations anchor around a constant steady state, so persistent deviations from trend generate systematic forecast errors. Under AL, the ALM constant mu_t in the Actual Law of Motion evolves over the business cycle. In the MSVflex model, the autocorrelation parameter for the constant is estimated at 0.9937 (posterior mean), making it nearly a random walk that can track long-lasting trends. ALM constants for consumption and investment in the MSV setup successfully pick up rising macroeconomic trends in earlier sub-periods, translating into superior longer-term forecast performance relative to RE.

Q7. How does the RB model generate time-varying volatility, and why does this matter for investment dynamics?

In RBflex, as beliefs are revised via the Kalman filter, the sensitivity of expectations and realized variables to shocks changes over the business cycle. The model generates cyclical volatility that remains lower in normal times and rises during distress — a realistic pattern absent from RE models. Consequently, RB does not need to rely as heavily on large persistent risk premium and investment-specific technology shocks: average volatility of these processes in the RB model does not increase in the last sub-period and remains generally lower across the whole sample, in contrast to RE’s behavior during the GFC. The RB model also shows a 3-times-smaller estimated measurement error in the investment SPF equation relative to the AL specification without restricted beliefs.

Q8. What happens to predictability of model-based forecast errors under AL versus RE?

Using the Coibion-Gorodnichenko (2015) regression of forecast errors on forecast revisions, the RE model’s investment forecast shows a b-coefficient of 0.72 (p < 0.001), inheriting the underreaction documented in SPF investment data (b = 0.49, p = 0.006). AL models break this inheritance: RBflex ALM b-coefficient for investment is 0.17 (not statistically significant) and HBflex is 0.34 (not statistically significant). AL models achieve this because they relax the RE constraint of internal consistency between agents’ and model forecasts, allowing the ALM to generate efficient forecasts even when agent PLMs display sluggish adjustment.

Q9. How do the models perform during the Covid recession?

The RBflex model does not perform optimally on the extended sample including the Covid recession. The authors attribute this to the restricted information set in the RB PLM being insufficient to describe the abrupt, complex macroeconomic dynamics of the Covid crisis. The MSVflex and HBflex models continue to perform well. In the HBflex model, the MSV regime naturally dominates during the Covid and post-Covid periods, while the RB regime had been more prominent between recessions in the pre-Covid sample.

Q10. What is the role of heterogeneous beliefs, and how do agents switch between PLMs?

In HBflex, expectations are a weighted average of MSV and RB predictions with weights evolving as a function of past belief forecast errors. The weight sensitivity parameter is estimated at delta = 4.77, indicating weights are relatively sensitive to fitness. The average estimated weight on the RB PLM is 0.34 (MSV receives 0.66 on average). The RB weight tends to increase and reach its highest values between recessions, consistent with the restricted model being more parsimonious and useful in stable periods, while the fuller MSV model dominates in high-volatility episodes such as the Covid recession.

Q11. What are the out-of-sample forecasting results?

The out-of-sample evaluation covers 2008q1–2019q2. The RB model outperforms the RE model in predicting investment and interest rate dynamics, and for investment it also outperforms professional forecasters during this period. At longer horizons (up to 5 quarters ahead), RE model forecasts are generally not statistically significantly different from SPF predictions once SPF nowcasts are included as observables, suggesting that observing the SPF data is sufficient to capture the most informative content from surveys for longer-horizon predictions.

Q12. What is the relationship to Milani (2017) and the prior literature on sentiment shocks?

Milani (2017) found that exogenous sentiment shocks orthogonal to fundamentals were needed to fit SPF forecasts alongside an AL model and explained a significant portion of US business cycle fluctuations. The current paper shows this result is not robust to re-specifying fundamental shocks into persistent and transitory components: once the two-component structure is introduced, sentiment shocks become small and economically unimportant (RES_all at –388.17 versus RE_all at –385.07). What Milani attributed to sentiment was largely capturing the inability of single-innovation shocks to separately account for high-frequency and low-frequency variance.

Key concepts

SPF Nowcast as proxy for model expectations: The Survey of Professional Forecasters’ nowcast is defined as a prediction formed in the middle of quarter t+1 for the value of a variable in that same quarter, conditional on information available through quarter t. This timing makes it directly comparable to the model-based conditional expectation E_t y_{t+1}, so the SPF nowcast can be added to the DSGE model’s observable set with a straightforward measurement equation linking it to model expectations plus i.i.d. measurement error.
Shock re-specification into persistent and transitory components: Each structural shock (risk premium, investment-specific technology, government spending, and markup shocks) is decomposed into an AR(1) persistent component driven by epsilon^bar and an i.i.d. transitory component driven by epsilon^biid, replacing the ARMA(1,1) specification in Smets-Wouters (2007) that conflates both into a single innovation. This decomposition is the key technical device enabling survey data to separately identify low-frequency and high-frequency sources of volatility.
Adaptive Learning (AL): An expectation-formation mechanism in which agents do not know true model parameters and instead estimate linear forecasting models (PLMs) that are updated each period via a Kalman filter algorithm. This produces a time-varying Actual Law of Motion — transmission parameters mu_t, T_t, R_t all evolve with beliefs — enabling endogenous trend drift and time-varying shock responses absent from RE models with fixed coefficients.
Minimum State Variable (MSV) beliefs with flexible constant: An AL specification in which agents use the same endogenous state variables and shocks as in the RE solution but with the constant term updated at an independent, more rapid rate. The constant’s autocorrelation is estimated at 0.9937, making it nearly a random walk capable of tracking persistent macroeconomic trend deviations from the deterministic steady state.
Restricted Beliefs (RB): An AL specification in which each agent’s PLM uses a reduced information set — autoregressive terms of the forward-looking variable augmented with selected shock innovations — rather than the full RE state space. This more severe departure from RE yields the largest marginal-likelihood gain over RE_all, generates realistic cyclical volatility amplification, and produces a 3-times-smaller measurement error for investment SPF, but underperforms during the Covid recession due to the restricted set’s inability to handle abrupt complex dynamics.
Heterogeneous Beliefs (HB): An AL specification in which agents may switch between MSV and RB PLMs as a weighted average, with weights evolving as a function of past belief forecast errors. The average weight on RB is 0.34 and the weight sensitivity delta is estimated at 4.77; the RB weight tends to be highest between recessions and lowest during high-volatility episodes such as the Covid recession when the fuller MSV information set dominates.
FIRE predictability test (Coibion-Gorodnichenko regression): Under Full Information Rational Expectations, the regression of forecast errors on forecast revisions should yield a b-coefficient of zero. A positive and significant b indicates systematic underreaction to news. The paper confirms b = 0.49 (p = 0.006) for investment SPF — but not for inflation, consumption, or GDP — and shows the RE model inherits this inefficiency (b = 0.72, p < 0.001 for investment), while AL models reduce it to insignificance (RBflex: 0.17; HBflex: 0.34).