Macro Paper Warehouse Forthcoming macro & monetary research
Forthcoming [Journal of Political Economy] doi:10.1086/742421

Mis(sed) Diagnosis: Physician Decision Making and ADHD

Kelli Marquardt

What this paper finds — and why it matters

This paper develops and estimates a structural model of ADHD diagnosis to decompose the mechanisms driving the observed 2.3:1 male-to-female diagnostic difference in the United States. The research question is: to what extent does the large gender gap in ADHD diagnosis reflect true differences in symptom prevalence, versus patient-side utilization costs, versus physician decision-making under uncertainty? The setting is particularly well-suited to this question because DSM-V diagnostic guidelines for ADHD are explicitly gender-neutral, making any gender difference in physician thresholds a detectable deviation from uniform clinical rules.

The data come from de-identified electronic health records from a large Arizona healthcare system covering January 2014 through September 2017. The sample encompasses 36,193 unique encounters for approximately 11,070 pediatric patients. The raw male-to-female diagnostic ratio in the data is 2.32:1 (7.2% of males vs. 3.1% of females receive a clinical ADHD diagnosis). This gap persists after controlling for demographics, general healthcare utilization, and mental health utilization in reduced-form regressions, motivating the structural approach.

Because two key variables — whether a patient received a behavioral assessment (Qi) and the ADHD match signal observed by the physician (xi) — are not directly recorded in the EHR, the author constructs them from clinical doctor note text. A random forest machine learning classifier trained on labeled appointments predicts behavioral assessment take-up for unlabeled encounters; approximately 20.8% of children are predicted to have received a behavioral assessment (23.2% of males vs. 18.3% of females). The ADHD match signal is constructed via an adjusted Bag-of-Words cosine similarity measure comparing each patient’s aggregated note text to the DSM-V symptom list, rescaled to [0,1]. The average signal is 0.319 overall, with males averaging 0.326 and females 0.311.

The structural model has three stages. First, patients/caregivers decide whether to schedule a behavioral assessment, a function of underlying latent ADHD risk (vi) and mental healthcare utilization costs (ci). Second, conditional on assessment, the physician receives a noisy signal of vi and updates beliefs via Bayesian learning; signal quality ρ governs diagnostic uncertainty. Third, the physician diagnoses ADHD if posterior risk exceeds a gender-specific diagnostic threshold τ. Population mean ADHD risk (μ) is identified using regression-adjusted initial primary care provider referral rates as a quasi-exogenous cost-shifter — patients of high-referral-rate providers select into assessment less selectively, so their observed signals approach population mean risk. This extrapolation approach follows Arnold et al. (2022).

The structural parameter estimates reveal that male and female children have similar but slightly different mean ADHD risk (μm = 0.290 vs. μf = 0.262) and similar mean utilization costs (cm = 0.116 vs. cf = 0.109). The most striking differences are in physician parameters: signal quality is lower for male patients (ρm = 0.479 vs. ρf = 0.552), indicating higher diagnostic uncertainty for boys; and diagnostic thresholds are substantially lower for male patients (τm = 0.257 vs. τf = 0.312), meaning physicians are willing to diagnose ADHD in boys with lower posterior risk.

Counterfactual decomposition simulations attribute approximately 20–25% of the 2.32:1 diagnostic gap to underlying differences in ADHD risk, approximately 20% to differences in selection into behavioral assessments, and the remaining majority — approximately 55–60% — to physician decision-making. Within physician decision-making, differences in diagnostic thresholds alone account for roughly two-thirds of the overall diagnostic gap.

The paper offers economic rationales for why gender-specific thresholds may be consistent with physician rationality despite uniform guidelines: higher diagnostic uncertainty for boys justifies lower thresholds under Bayesian updating; hyperactive/impulsive symptoms predominant in boys impose larger classroom externalities (Aizer, 2008); and female patients show higher rates of internalizing co-morbidities (anxiety, depression) that may reduce the marginal benefit of an additional ADHD diagnosis. A type-specific threshold extension finds that for male patients the threshold for hyperactive/impulsive symptoms is significantly lower than for inattentive symptoms, consistent with salience of externally disruptive behaviors. These rationalizations do not vindicate the gap as fully guideline-consistent, but suggest physicians may be responding to real heterogeneity in external costs and co-morbidity patterns.

Q: What is the main research question and why is ADHD a useful setting? A: The paper asks what mechanisms produce the 2.3:1 male-to-female ADHD diagnostic difference: true symptom prevalence, patient utilization costs, or physician decision-making. ADHD is well-suited because (1) clinical guidelines (DSM-V) are explicitly gender-neutral and require the same symptom count threshold regardless of sex; (2) diagnosis is based on subjective behavioral assessment rather than objective testing, creating substantial physician discretion; and (3) both missed and excess diagnosis carry meaningful costs — missed diagnosis limits educational accommodations; excess diagnosis exposes children to Schedule II controlled substances.

Q: What data does the paper use and what are the key descriptive facts? A: The data are de-identified electronic health records from a large Arizona healthcare system, 2014–2017, covering 36,193 encounters for 11,070 pediatric patients aged 5 and above. Overall ADHD diagnosis rate is 5.2%, with males at 7.2% and females at 3.1%, a 2.32:1 ratio that matches national levels. Approximately 49.5% of the sample is Hispanic, which the author notes contributes to a below-national-average overall diagnosis rate. The gender diagnostic gap persists even after controlling for demographics, general healthcare utilization, and mental health utilization in reduced-form regressions.

Q: How does the paper construct the behavioral assessment indicator (Qi) and the ADHD match signal (xi)? A: Qi is constructed using a random forest classifier trained on doctor notes from appointments where assessment status is known with near-certainty (ADHD diagnosis or DSM-V comorbid diagnosis = positive; non-mental-health diagnosis code for patients with no mental health history = negative). The classifier uses 41 features including note length and top-20 word frequencies for each label class. xi is constructed via an adjusted Bag-of-Words cosine similarity between each patient’s combined behavioral assessment notes and the DSM-V symptom list, separately for inattentive and hyperactive/impulsive sub-types, taking xi = max{xi1, xi2}. The average xi is 0.319 (males 0.326, females 0.311) in the behavioral assessment subsample.

Q: What is the identification strategy for recovering population mean ADHD risk (μ)? A: Because xi is observed only for endogenously selected patients, the observed sample mean overestimates population mean risk. The author uses regression-adjusted referral rates of each patient’s initial primary care provider (IPCP) as a quasi-exogenous cost-shifter satisfying (a) relevance — IPCP referral intensity lowers patient scheduling costs — and (b) independence from patient ADHD risk vi, since IPCPs are typically chosen before behavioral symptoms develop and only 28% of IPCPs in the sample ever diagnose ADHD themselves. Population mean risk is then recovered by extrapolating the relationship between IPCP referral propensity and average observed xi to propensity = 1, following Arnold et al. (2022). The maximum observed IPCP referral propensity is only about 0.75, so the estimate requires extrapolation beyond the observed support.

Q: What are the estimated structural parameters and what do they imply? A: Mean ADHD risk is μm = 0.290 vs. μf = 0.262 — males have modestly higher underlying risk. Mean utilization costs are cm = 0.116 vs. cf = 0.109 — nearly identical across genders. Signal quality (diagnostic certainty) is lower for males: ρm = 0.479 vs. ρf = 0.552, indicating physicians face more diagnostic uncertainty when assessing boys. Most importantly, diagnostic thresholds are lower for males: τm = 0.257 vs. τf = 0.312, meaning physicians diagnose ADHD in boys at a lower required posterior risk level, consistent with viewing missed diagnosis as relatively more costly for male patients.

Q: How much of the 2.32:1 diagnostic gap can be attributed to each mechanism? A: Counterfactual simulations decompose the gap as follows: differences in underlying ADHD risk distribution account for approximately 20–25% of the diagnostic difference; differences in selection into behavioral assessments (utilization costs operating through assessment rates) account for approximately 20%; and physician decision-making differences account for the remaining majority, approximately 55–60%. Within physician factors, differences in diagnostic thresholds (τm < τf) are the single largest contributor, explaining roughly two-thirds of the overall male/female diagnostic gap.

Q: What do the type-specific threshold estimates reveal? A: When the baseline model is extended to allow separate diagnostic thresholds for inattentive vs. hyperactive/impulsive symptom sub-types, male patients show significantly lower thresholds for hyperactive/impulsive symptoms relative to inattentive symptoms (τ^HI_m < τ^Inatt_m). This is consistent with the hypothesis that more externally salient and disruptive symptoms carry larger classroom externalities, which physicians may implicitly factor into diagnosis decisions (following Aizer, 2008). For female patients, the threshold differences across symptom types are smaller and less statistically significant.

Q: What economic rationales does the paper offer for gender-specific diagnostic thresholds despite uniform guidelines? A: Three mechanisms are identified. First, higher diagnostic uncertainty for males (lower ρm) implies that under symmetric costs, Bayesian-rational physicians should set lower thresholds when the signal is noisier — this alone partially rationalizes the threshold gap. Second, hyperactive/impulsive symptoms predominant in boys impose greater externalities on classroom peers (Aizer, 2008), increasing the social benefit of diagnosis for boys on the margin. Third, females show substantially higher rates of co-morbid internalizing conditions (anxiety, depression) whose treatment may mitigate ADHD-related behaviors or whose interaction with stimulant medication makes the marginal ADHD diagnosis less beneficial for girls (Currie et al., 2014). These factors together suggest physicians may be responding to genuine heterogeneity in net diagnosis benefits, even if their behavior deviates from gender-neutral clinical guidelines.

Q: What share of the 2.3:1 national diagnostic gap is consistent with genuine symptom prevalence differences? A: Simulations indicate that only about 20–25% of the 2.32:1 male/female diagnostic difference can be explained by the underlying difference in ADHD risk distributions. The majority — roughly 75–80% — reflects factors beyond true prevalence: selection into care and, most substantially, physician decision-making differences including both signal quality and diagnostic thresholds.

Q: What are the policy implications? A: The findings suggest that targeted interventions in physician awareness and clinical training are likely more effective than generic awareness campaigns, since the dominant driver of the diagnostic gap is physician threshold-setting rather than symptom prevalence. Structured decision support tools or updated training that make physicians aware of gender-specific diagnostic patterns could reduce medically unwarranted diagnostic differences. Policies targeting patient-side access barriers (the ~20% explained by selection) remain relevant but secondary. The roughly 20–25% of the gap attributable to genuine symptom prevalence differences is, by construction, guideline-consistent and should not be targeted for elimination.

Q: What are the methodological contributions? A: The paper makes three methodological contributions. First, it develops a structural model of mental health diagnosis that explicitly incorporates endogenous patient selection — a feature absent from standard physician decision-making models — which is shown empirically important. Second, it applies machine learning and NLP to clinical doctor note text to construct key unobserved clinical variables (behavioral assessment indicator and ADHD match signal) that are unavailable as structured data in EHRs. Third, the identification of population mean health risk uses a quasi-exogenous variation approach (IPCP referral rates) analogous to Arnold et al. (2022)’s method for measuring racial discrimination in bail decisions, adapted here to a continuous health risk setting with endogenous selection.

Diagnostic threshold (τ_θ): The gender-specific posterior ADHD risk level above which a physician chooses to diagnose ADHD. Set ex-ante, it reflects the physician’s perceived tradeoff between the costs of over-diagnosis (misdiagnosis) and under-diagnosis (missed diagnosis). A lower threshold implies the physician views missed diagnosis as relatively more costly for that patient group. By construction, uniform clinical guidelines imply a single threshold independent of patient gender.

ADHD match signal (x_i): A physician-observed, noisy signal of a patient’s true latent ADHD risk (v_i), observed only conditional on the patient receiving a behavioral assessment. In estimation, it is proxied via a cosine similarity measure between the patient’s aggregated clinical doctor note text and the DSM-V symptom list, constructed separately for inattentive and hyperactive/impulsive sub-types.

Signal quality / diagnostic uncertainty (ρ_θ): The correlation between the physician’s observed ADHD match signal and the patient’s true ADHD risk. Higher ρ means the physician’s signal is more informative and diagnostic uncertainty is lower. In the Bayesian updating framework, higher ρ implies the physician places more weight on the observed signal relative to the prior.

Mental healthcare utilization cost (c_i): The composite of all patient/caregiver factors that affect the decision to schedule a behavioral assessment net of child symptom level. Includes non-monetary barriers such as time constraints, distance, stigma, and information from primary care providers during wellness visits; does not include monetary out-of-pocket costs since insurance typically covers behavioral assessments.

Initial Primary Care Provider (IPCP) referral rate: The regression-adjusted share of a given PCP’s patients who ultimately receive a behavioral assessment at some point in the sample. Used as a quasi-exogenous cost-shifter that influences patient scheduling costs without being correlated with patient ADHD risk, enabling identification of population mean ADHD risk via extrapolation.

Latent ADHD risk (v_i): An unobserved continuous measure of a child’s underlying ADHD-related behavioral symptoms, drawn from a gender-specific normal distribution N(μ_θ, σ²_θ). A child’s true ADHD status is Si = 1(v_i > v̄), where v̄ is the DSM-V minimum symptom threshold, defined identically for boys and girls.

Adjusted Bag-of-Words (BOW) cosine similarity: The NLP method used to construct the ADHD match signal proxy. Patient notes are tokenized into uni-grams and bi-grams after preprocessing (spell check, abbreviation replacement, part-of-speech tagging, synonym replacement), and tf-idf weighted. The cosine similarity between the resulting document vector and the DSM-V symptom text vector is computed separately for each ADHD sub-type and rescaled to [0,1].

How this summary was made. Bibliographic fields are pulled from Crossref and OpenAlex and are not model-generated. The summary was drafted from the open-access manuscript , checked by a claim-grounding and calibration review pass, and approved before publishing. Found an error or a misrepresentation? Flag it here — corrections are welcome, especially from the authors.