Forthcoming [American Economic Journal: Macroeconomics] doi:10.1257/mac.20240184

Identifying Monetary Policy Shocks: A Natural Language Approach

S. Borağan Aruoba

Thomas Drechsel

What this paper finds — and why it matters

Layer 1: Overview

Research question and motivation: To study how monetary policy affects the economy, macroeconomists must isolate “shocks” — changes in interest rates that are not systematic responses to economic conditions. The paper proposes a new identification method that captures the Federal Reserve’s information set far more comprehensively than prior approaches, using the natural-language text of documents Fed staff prepare for FOMC meetings, not just numerical forecasts.

Method and data: The approach extends Romer and Romer (2004), who regress changes in the Federal Funds Rate (FFR) target on Greenbook forecasts and take the residual as the shock. The authors instead convert the text of FOMC documents into many “aspect-based” sentiment time series and predict the FFR change with both these sentiments and an expanded forecast set. They process 772 PDF files for 276 meetings (630 files for 210 meetings before the zero lower bound), covering Greenbook 1/2, Tealbook A, Redbook, and Beigebook documents, starting October 5, 1982 (when the Fed began targeting the FFR per Thornton 2006). Most documents are released with a 5-year lag, so the latest is from end-2016. They extract the most frequently mentioned economic terms, yielding 296 single/multi-word concepts (e.g., “inflation,” “economic activity”). For each concept they build a sentiment indicator by scoring positive (+1) and negative (-1) words within a 10-word window, using an augmented Loughran-McDonald (2011) dictionary of 2,882 classified words. The empirical model (equation 3) includes 132 forecast series, 296 sentiment indicators with 4 lags, and quadratic terms — 3,226 regressors total — far exceeding the 210 FOMC-meeting observations over October 1982 to October 2008. They estimate it with a ridge regression, choosing the penalty by 10-fold cross-validation; the shock is the residual.

Main quantitative findings: (1) Fit/systematic share: the original Romer-Romer OLS specification yields R-squared of 0.50 (so 50% of FFR variation is attributed to shocks), while the preferred nonlinear ridge with forecasts and sentiments yields R-squared of 0.94 — cutting the exogenous shock share from 50% to 6%, an almost ten-fold reduction. Lags 0–4 give R-squared of 0.75, 0.81, 0.90, 0.92, 0.94. (2) Information content: text-based sentiments predict Greenbook unemployment-rate forecast errors; a one-standard-deviation increase in the sentiment first principal component is associated with an almost 0.5 percentage-point negative 1-year-ahead forecast error (R-squared up to 0.25), supporting the view that staff forecasts are modal, not mean, predictions. (3) Comparison to high-frequency surprises: correlation with Swanson (2021) FFR surprises (1991–2008) is 0.49 (vs. 0.36 for Romer-Romer); 0.77 for the top-10 shocks (vs. 0.61) and 0.51 for the top-10 surprises (vs. 0.18). The estimated shocks have lower autocorrelation (0.066 vs. 0.204 for Romer-Romer). (4) IRFs (BVAR with shock as external instrument, IRF sample 1984:02–2016:12): a tightening produces a persistent yield rise (about 20 months), a fall in real output and rise in unemployment materializing after about a year, a sluggish decline in the price level (mild initial “price puzzle,” visibly negative after about 18 months, significantly negative after 30 months), a sharp rise in the excess bond premium, and a fall in stock prices — all consistent with theory. By contrast, Romer-Romer OLS residuals imply flat output/unemployment responses, an insignificant EBP response, and positive stock-price/rate comovement, at odds with theory.

Implications: Including text-based information is essential for clean identification — even for the original method to correctly recover responses (especially of unemployment). A Beigebook-only version extends the method to recent meetings, implying the 2022–2023 tightening (525 bp total) carried only about 21 bp of contractionary shock.

Layer 2: Deep Dive

What exactly is the identification strategy, and what are the main threats to it?

Monetary policy shocks are defined (equation 1) as the residual after orthogonalizing the FFR target change against the central bank’s information set. The authors proxy that information set with the full numerical-forecast set plus 296 text-derived sentiment indicators (with 4 lags and quadratic terms), and estimate the prediction via ridge regression with 10-fold cross-validation. The shock is the residual. Two key assumptions inherited from Romer-Romer are threats: (i) the included variables must be a good proxy for the true information set — the paper argues forecasts alone are insufficient because they are modal, not mean, predictions and assume a specific policy path (Faust-Wright 2008), which is why text is required; and (ii) the mapping from information to decisions must be well-specified — they relax linearity by adding quadratic terms. A residual concern is that even the large information set may not capture truly idiosyncratic considerations, but they argue this is exactly what should remain in the shock.

Why are text sentiments necessary beyond numerical forecasts — what is the Cochrane critique and how do they answer it?

Cochrane (2004) argued that to study the effect of policy on a given variable, it suffices to orthogonalize the FFR against the Fed’s forecast of that variable alone, since an efficient forecast incorporates all relevant information. This holds only if Greenbook forecasts equal the conditional mean. The authors show, via FOMC transcripts (Appendix D, spanning 1985–2016) and econometrics, that staff produce MODAL forecasts accompanied by verbal descriptions of asymmetric risks. Their sentiment indicators predict Greenbook unemployment forecast errors (Table 2): the first PC and even the single ’economic activity’ sentiment are significant at multiple horizons (R-squared up to 0.25; a 1-sd PC increase implies an almost 0.5 pp negative 1-year error). After orthogonalizing forecast errors on sentiment, the error distribution becomes more symmetric and centered on zero (Figure 3). Hence at least some text information is required even for the original Romer-Romer method to recover the true unemployment response.

Why ridge regression rather than LASSO or OLS?

OLS is infeasible (3,226 regressors vs. 210 observations). Ridge minimizes residual sum of squares plus a penalty on squared coefficients (shrinkage toward zero), equivalent to Bayesian OLS with a normal prior centered at zero. Unlike LASSO (which produces sparse models), ridge keeps all regressors (a dense model), more akin to factor models/PCA. The authors prefer dense methods because economic data have many correlated regressors and few observations; Giannone, Lenza, and Primiceri (2022) (’the illusion of sparsity’) find sparse methods become unstable under high collinearity — clearly present across forecasts and sentiments here. The penalty lambda is chosen by 10-fold cross-validation, so the high R-squared is not purely mechanical.

How do the authors interpret what the shocks capture, and what case studies support this?

They inspect FOMC discussions in meetings with the largest estimated shocks. November 7, 1984: largest shock in absolute value — a 75 bp FFR decline of which staff forecasts/sentiments predict 53 bp, leaving a -22 bp easing shock, driven by FOMC participants finding the staff forecast too optimistic. November 15, 1994: a 75 bp hike of which 21 bp is a contractionary shock — Greenspan argued ‘a mild surprise would be of significant value’ for credibility, and the 75-vs-50 bp gap between his decision and the staff’s option almost exactly matches the estimated 21 bp. The interpretation: shocks are FFR decisions that are ‘surprises’ to the Fed staff — orthogonal to the staff’s information set. They note their interpretation is narrower than Romer-Romer’s (which included target-definition changes and political pressure, both pre-1982 phenomena per Drechsel 2023). Systematic credibility concerns would be absorbed into systematic policy; only nonsystematic ones become shocks.

What are the three interpretations of why Romer-Romer IRFs go wrong, and how are they distinguished?

(1) Unemployment: because Greenbook unemployment forecasts are modal and text-sentiment predicts their errors, the Romer-Romer OLS cannot fully absorb asymmetric risk shifts, producing a spurious correlation (easing shocks estimated when unemployment rises) and thus a flat/incorrect unemployment IRF (Figure 6). (2) Stock prices: the Fed systematically reacts to equities (Cieslak and Vissing-Jorgensen 2020); failing to control for this leaves spurious positive rate/stock comovement. They test this by adding HF S&P500 surprises as a second instrument with Jarocinski-Karadi (2020) sign restrictions (negative rate/stock comovement for policy shocks): their measure already satisfies the restrictions (Panel a barely changes), whereas the Romer-Romer IRFs change drastically once imposed, ‘correcting’ activity/price/EBP responses (Figure 7). (3) Credit spreads: Romer-Romer residuals retain endogenous credit-spread variation; the authors’ sentiments include ‘spreads,’ ‘credit standards,’ ‘credit quality.’ Caldara and Herbst (2019) show that ignoring the Fed’s credit-spread reaction attenuates IRFs, supporting this channel.

What robustness checks are run?

(1) 5-word vs. 10-word sentiment windows give nearly identical R-squared (0.95 vs. 0.94 in the top spec). (2) Sentence-based sentiment construction is highly correlated with the window-based version (0.875 for employment, 0.959 for credit; Appendix C). (3) Lag structure: 0–4 lags raise R-squared 0.75→0.94 with diminishing gains past 4 lags. (4) FOMC composition controls (governor/bank-rep attendance, voting status, appointing president, female attendance) raise R-squared by less than 0.1% — personal dynamics do not drive FFR changes. (5) Alternative nonlinear forms: cubic residuals 99% correlated with quadratic; a ~40,000-variable full-interaction spec yields residuals 96% correlated with quadratic. (6) Forecast-error predictability holds for output and inflation too (Appendix E), and using first-release vs. final-vintage data gives similar results. (7) Local projections (Jorda 2005) confirm the BVAR results, with Romer-Romer again off-theory. (8) IRFs built from only the 10 largest shocks reproduce the main pattern. (9) The extended-forecast ridge (no sentiments) already corrects the IRFs, though the authors stress theory-consistent IRFs are necessary but not sufficient for a good shock measure.

How does the Beigebook-only extension work and what does it find?

Tealbooks/forecasts are released with a 5-year lag, but Beigebooks are public before each meeting. Over 1982–2008, building sentiments from Beigebooks alone gives indicators strongly correlated with the baseline (e.g., ’economic activity’, Figure 8), an R-squared of 0.68 (vs. 0.94 with full documents), and shocks correlated 0.92 with the baseline shocks, with qualitatively similar IRFs. As a proof of concept over December 2015–October 2023 (excluding the March 2020–December 2021 ZLB period), the R-squared is 0.98. Inflation sentiment dropped more than 6 standard deviations in late 2021/early 2022 (driven by ‘concern’ near ‘inflation’). The 2022–2023 tightening of 525 bp total implies only about 21 bp of cumulative contractionary shock — i.e., mostly systematic tightening. This extension is impossible for Romer-Romer because Beigebooks contain no numerical forecasts.

How does this paper relate to and differ from closely related prior work?

It contributes to three literatures. (1) Monetary-shock identification: builds directly on Romer-Romer (2004) but adds NLP/ML and a much larger information set; contrasts with SVAR and high-frequency approaches (Gurkaynak et al. 2005, Gertler-Karadi 2015, Swanson 2021, Bauer-Swanson). (2) Text/ML on Fed documents: unlike Sharpe-Sinha-Hollrah (2020), who build a single sentiment index, the authors build aspect-based sentiments per concept; closest are Handlan (2020), who builds a ’text shock’ separating forward guidance from current assessment since 2005, and Ochs (2021), who extracts surprises from the private agents’ viewpoint — the authors instead orthogonalize against the Fed’s internal information set, staying closer to Romer-Romer. (3) Greenbook-forecast literature (Romer-Romer 2000, Faust-Wright, Nakamura-Steinsson 2018): they emphasize the modal nature of forecasts and show sentiments explain forecast errors on average.

What are the policy/research implications and their scope conditions?

The method delivers a cleanly identified, ‘all-purpose’ shock series usable for any macro variable — including ones without Fed forecasts (e.g., credit spreads). It spans a longer period than HF measures (which begin in the early 1990s due to futures-data availability and the fact that the FOMC did not announce rate changes publicly before 1994). Scope conditions: the preferred (Tealbook-based) measure requires the 5-year document lag, so recent meetings need the lower-fidelity Beigebook-only version (R-squared 0.68 in-sample); the main estimation sample ends October 2008 to avoid the ZLB. The method relies on the structured, consistent wording of Fed-staff documents, making dictionary-based sentiment particularly applicable. The authors recommend using the baseline measure whenever feasible, even at the cost of dropping recent observations, and resorting to Beigebook-only only when that cost is high. They also suggest combining their measure with HF surprises as multiple external instruments.

Are there caveats about interpreting the model’s coefficients?

Yes. The ridge is built for prediction (y-hat), not coefficient interpretation (beta-hat). With 3,226 highly collinear regressors plus lags and quadratic terms, individual coefficients cannot be cleanly interpreted — the authors invoke Mullainathan-Spiess (2017) that ML belongs in the y-hat toolbox, and a self-driving-car analogy. A potential downside of a large information set is low statistical power in the shock (since more variation becomes systematic), but they show via the BVAR IRFs that power is not a problem in practice.

Key Concepts

How this summary was made. Bibliographic fields are pulled from Crossref and OpenAlex and are not model-generated. The summary was drafted from the open-access manuscript , checked by a claim-grounding and calibration review pass, and approved before publishing. Found an error or a misrepresentation? Flag it here — corrections are welcome, especially from the authors.