M54 | Macro Paper Warehouse

Defying Distance? The Provision of Medical Services in the Digital Age

Mon, 01 Jan 0001 00:00:00 +0000

This paper asks whether digital platforms can improve healthcare outcomes by enabling needs-based matching between patients and physicians unconstrained by geography. Amanda Dahlstrand studies digital primary care in Sweden during 2016-2018, exploiting nationwide conditional random assignment between approximately 200,000 patients and 143 doctors employed by Europe’s largest digital primary care provider. Patients who selected the “first available doctor” option (82% of first visits) were effectively randomized to a doctor within each 3-hour shift-by-date stratum, generating quasi-experimental variation free of the patient-doctor sorting that confounds identification in physical primary care.

The paper defines three observable dimensions of primary care physician skill: (1) identifying risky patients and triaging them to higher levels of care, measured by whether patients subsequently have an avoidable hospitalization within 90 days; (2) providing guideline-consistent treatment, measured by counter-guideline antibiotic prescriptions; and (3) leaving patients sufficiently informed so they do not unnecessarily seek additional in-person care within the following week. Doctor skill in each dimension is estimated via a value-added framework in a hold-out sample (Sample 1, the first 600 randomized consultations per doctor), using empirical Bayes shrinkage to reduce noise. Complementarities between doctor skill and patient risk are then estimated in a disjoint main sample (Sample 2).

A central finding is that doctor skill is task-specific rather than governed by a single latent ability: skills across the three tasks are not positively correlated, meaning doctors within general practice have individual “specializations.” A patient ranked in the top 1% of avoidable hospitalization risk who is matched to a doctor ranked in the top 10% at reducing avoidable hospitalizations experiences a 90% reduction in that adverse outcome, relative to a patient with the same risk profile matched to the worst-performing doctor. Patients not estimated as risky show effects indistinguishable from zero when matched to the same high-skilled doctors, establishing a strong complementarity between doctor type and patient risk.

Using the Average Match Function framework of Graham, Imbens, and Ridder (2014, 2020), the paper evaluates counterfactual reallocation policies. Reallocating only 2% of patients — those in the top 1% of predicted avoidable hospitalization risk — to doctors in the top 10% of triage skill reduces aggregate avoidable hospitalizations by 20% relative to random assignment, without adversely affecting counter-guideline prescriptions or other measured outcomes. Doctor skills across outcomes are not positively correlated, so this reallocation does not generate meaningful trade-offs. The paper benchmarks this matching policy against a selective hiring/expansion policy in which doctors with above-median skill in three tasks expand their hours by up to 70% at the expense of below-median peers; that policy yields no significant reduction in avoidable hospitalizations and only a 4% reduction in counter-guideline prescriptions — smaller gains than matching and harder to implement.

The paper also documents that physical primary care quality is worse in lower-income and more deprived areas of Sweden (a negative relationship between deprivation index and patient-reported experience is statistically significant at the 1% level in a cross-section of roughly 120-150 primary care centers in Region Skane). Because the estimated risk of avoidable hospitalization and prior avoidable hospitalizations are concentrated in the lower end of the income distribution, needs-based digital matching reallocates triage skill toward lower-income patients, severing the correlation between local area income and service quality. Simulating positive assortative matching on patient income and doctor skill — approximating existing healthcare inequalities — leads to more avoidable hospitalizations than random assignment, because the most vulnerable patients tend to be the poorest. Scope conditions: findings derive from a single digital primary care provider in Sweden, 2016-2018, pre-pandemic, covering conditions amenable to video consultation and a patient pool younger and somewhat more urban than the average Swedish citizen.

Q: What is the key identification strategy, and why is it valid in this setting but not in physical primary care? A: Patients who selected the “drop in” (first available doctor) option — 82% of first visits — were assigned to whichever certified doctor was next in the roster within a 3-hour shift-by-date stratum, a by-product of the first-come-first-served queue. Neither patients nor doctors could intervene in this digital process. The author validates the assumption by regressing doctor characteristics on patient characteristics controlling for shift-by-date fixed effects and finds characteristics are balanced. In physical primary care, endemic patient-doctor sorting means doctors do not meet a common support of patient types, preventing causal identification of doctor effects.

Q: How are doctor skill estimates constructed and why does the split-sample matter? A: Doctor skill in each task is estimated as an empirical Bayes-shrunk random effect from a value-added regression on Sample 1, each doctor’s first 600 randomized consultations (40% of the sample). Sample 2 (60%) is entirely disjoint and used to estimate complementarities between doctor skill and patient risk. The split-sample design prevents overfitting: doctor skill was estimated on different patients than those in Sample 2. The Durbin-Wu-Hausman test does not reject random effects (p = 0.16).

Q: What is the main quantitative result on avoidable hospitalization matching? A: A patient ranked in the top 1% of predicted avoidable hospitalization risk matched to a doctor ranked in the top 10% at reducing avoidable hospitalizations could reduce that patient’s avoidable hospitalizations by 90%, relative to the worst-performing doctor in that skill. At the aggregate level, reallocating only 2% of patients (those in the top 1% risk group) to high-triage-skill doctors reduces avoidable hospitalizations across the full patient population by 20% compared to random assignment.

Q: Does the avoidable hospitalization reallocation harm other outcomes? A: No. The paper explicitly evaluates the Average Reallocation Effect on counter-guideline prescriptions and additional in-person care seeking when optimizing for avoidable hospitalizations, and finds no significant adverse effects on these other outcomes. The author attributes this to the fact that doctor skills across tasks are not positively correlated, so reallocating triage-skilled doctors does not systematically remove skill from other dimensions.

Q: How does matching compare to selective hiring and hour expansion as a policy? A: Even expanding the working hours of doctors with above-median skill across three tasks by as much as 70% yields no significant reduction in avoidable hospitalizations and only a 4% reduction in counter-guideline prescriptions — both smaller gains than the matching policy. Matching outperforms hiring expansion because patients have heterogeneous needs that can be identified from prior healthcare records, and doctors have differentiated skill sets relevant to some patients but not others.

Q: What is the evidence that doctor skills are task-specific rather than reflecting a single latent ability? A: The estimated doctor effects across the three tasks — triaging to avoid hospitalizations, guideline-consistent antibiotic prescribing, and minimizing unnecessary follow-up care — are not positively correlated with one another. This means a doctor who is effective at one task is not systematically effective at others, indicating individual specializations within general practice that are not accounted for in standard primary care organization.

Q: How is patient risk for avoidable hospitalizations measured? A: A propensity score is estimated from pre-digital physical healthcare data (2013-2015), regressing past number of avoidable hospitalizations on demographic and healthcare utilization variables — including age, a disease index of chronic diagnoses, and previous hospitalizations — all variables already available in patient medical records. The top 1% of predicted risk scores are classified as “risky.” Patients in the risky group had on average 0.35 avoidable hospitalizations in the prior 3 years, versus 0.01 for non-risky patients.

Q: What is the distributional (equity) implication of needs-based matching versus income-assortative matching? A: Estimated risk of avoidable hospitalization and the count of prior avoidable hospitalizations are concentrated in the lower end of the income distribution. Needs-based matching therefore reallocates triage skill toward lower-income patients. Simulating positive assortative matching on patient income and doctor skill — approximating observed inequalities in physical care — produces more avoidable hospitalizations than random assignment, because the most vulnerable patients are often the poorest. Needs-based digital matching can sever the link between local area income and service quality.

Q: How does digital care usage sort by income and demographics in the data? A: At the extensive margin, the deprivation index (Care Need Index) is similar among digital users and non-users in Region Skane. However, at the intensive margin, individuals with a higher deprivation index who use the digital service have more appointments in it; similarly, lower-income users use the service more intensively. Digital care users are younger than non-users and are more likely to live in cities than the average Swedish citizen.

Q: What are avoidable hospitalizations and why are they the primary outcome? A: Avoidable hospitalizations (also called hospitalizations for ambulatory care sensitive conditions) are hospital admissions defined in the medical literature as preventable by adequate and timely primary care. They are coded using ICD-10 diagnosis codes listed in Page et al. (2007). The most common diagnoses in the 90-day post-consultation window are respiratory and genitourinary, conditions commonly treated in digital care. The outcome is rare (0.2% of patients in the sample), but high-stakes: an estimated 1.1 potential life years are lost per avoidable hospitalization, and in Sweden they cost an estimated SEK 7.1 billion (~$820 million) annually (7% of inpatient curative and rehabilitative care costs).

Q: What is the scope of the counter-guideline antibiotic prescription outcome? A: Non-adherence is coded against 16 guidelines from Sweden’s strategic programme against antibiotic resistance (Strama 2017, 2019), all designed to limit or narrow antibiotic use. The measured rate of non-adherence is described as quite low by international standards; the CDC estimates 28% of US antibiotic prescriptions are unnecessary, while the author’s sample rate is 2%. The guidelines require doctors to sometimes refuse patients who request antibiotics, introducing a behavioral compliance dimension to this skill.

Q: What are the costs and feasibility considerations for implementing needs-based digital matching? A: The paper characterizes matching as a “resource-neutral” policy because it reallocates existing doctors without hiring or training. The primary costs are a small increase in waiting time for some patients and the costs of importing data and developing the matching algorithm. Because the algorithm handles patient-doctor allocation while doctors retain all clinical decision-making, the policy functions as a complement to human skill rather than a substitute, which the author argues makes it less subject to “algorithm aversion.”

Q: Why does the paper restrict to each patient’s first digital consultation only? A: The first visit is the one subject to conditional random assignment; subsequent visits could reflect endogenous selection by patients who preferred a particular doctor or outcome. Using only first visits eliminates this concern. The restriction reduces the sample from approximately 378,000 to 210,171 patients (56% of the original), paired with 143 doctors who each had at least 600 randomized consultations.

Conditional random assignment: The allocation mechanism by which patients selecting the “first available doctor” option in digital primary care were assigned to whichever certified doctor was next in the shift roster, conditional on 3-hour shift-by-date strata — a by-product of the first-come-first-served queue rather than an intended experimental design.

Average Match Function (AMF): The conditional mean of a patient outcome given observable doctor type and patient type under random assignment, β(x,w) = E[Y|X=x, W=w], which serves as the building block for evaluating counterfactual reallocation policies.

Average Reallocation Effect (ARE): The difference in expected patient outcomes between a counterfactual doctor-patient assignment and the status quo random assignment, taking into account the externality on the patient from whom a high-skilled doctor is moved.

Task-specific doctor skill: The paper’s finding that primary care physician effectiveness is not governed by a single latent ability but varies across distinct tasks — triage/risk prediction, guideline-consistent prescribing, and minimizing unnecessary follow-up care — with skills across tasks not positively correlated.

Avoidable hospitalization: A hospital admission coded to a diagnosis (per Page et al. 2007 ICD-10 classification) defined in the medical literature as preventable by adequate and timely primary care, used as the primary high-stakes outcome measure (0.2% incidence in the sample within 90 days of a digital consultation).

Counter-guideline prescription: A prescription of an antibiotic in violation of one of 16 guidelines from Sweden’s Strama antibiotic resistance programme, all of which are designed to limit use or require narrower-spectrum first-line antibiotics; used as the primary guideline-adherence outcome (2% incidence in the sample).

Empirical Bayes shrinkage: A procedure applied to raw doctor value-added estimates in which the noisy estimate of doctor quality is multiplied by the ratio of signal variance to total (signal plus noise) variance, yielding a best linear predictor of the underlying doctor random effect and reducing noise from small-sample estimation.

How Do You Identify a Good Manager?

Mon, 01 Jan 0001 00:00:00 +0000

This paper develops a novel experimental method to identify the causal contribution of managers to team performance, and uses it to evaluate which characteristics predict managerial effectiveness and how manager selection mechanisms affect organizational outcomes.

The core identification challenge is that managers are not randomly assigned to teams in the field, and field managers are a highly non-random sample, making it difficult to infer which traits genuinely predict managerial performance. The authors address this by repeatedly randomly assigning managers to multiple teams in a controlled laboratory experiment, then estimating each manager’s average causal contribution to group output after conditioning on group members’ individual productive skills. The intuition is that a good manager is someone who consistently causes their team to produce more than the sum of their parts.

The experiment was conducted at the University of Essex lab with 555 participants (46% female, mean age 25, ethnically diverse) forming 728 groups of three across four rounds. Each group consisted of one manager and two workers who performed a Collaborative Production Task requiring coordination across three problem-solving modules (numerical, spatial, and analytical reasoning). The team score was the minimum module score — a weakest-link structure making coordination essential. Prior to group testing, all participants completed individual assessments of task-specific skill, fluid intelligence (CFIT), emotional perceptiveness (Reading the Mind in the Eyes Test, RMET), economic decision-making skill (the Assignment Game, which measures resource allocation under comparative advantage), Big 5 personality, and demographic characteristics. Manager selection was randomly varied at the session level: in 20 sessions, the participant with the strongest preference for leadership became manager (self-promotion); in 19 sessions, managers were assigned by lottery.

The main quantitative findings are as follows. First, there are large, stable, and statistically significant manager effects: a manager one standard deviation above average improves team performance by approximately 0.23 standard deviations (p = 0.04). This estimate is roughly 90% the size of the combined productive skill coefficient for the two workers (approximately 0.26 sd), indicating that a good manager is roughly twice as valuable as a good individual worker. Manager contributions predict out-of-sample group performance in a leave-one-out procedure (p < 0.01).

Second, among randomly assigned managers, only two predictors significantly explain managerial performance: fluid intelligence (CFIT) and economic decision-making skill (Assignment Game scores), both significant at below the 1% level. Gender, age, and ethnicity do not predict managerial performance.

Third, self-promoted managers perform substantially worse than lottery-assigned managers, by approximately 0.10 standard deviations — roughly equivalent to being assigned a manager with fluid intelligence one full standard deviation below average. The mechanism is overconfidence: people who strongly prefer management roles are significantly more overconfident (d = 0.41 sd, p < 0.01) and exhibit a strong negative correlation between self-reported social skills and actual emotional perceptiveness on the RMET (r = -0.37, p < 0.001). Among self-promoted managers, self-reported extraversion and political skill are negatively correlated with managerial performance (rho = -0.24 and -0.26, p < 0.05); no such negative relationship appears among lottery managers.

Fourth, selecting managers on economic decision-making skill rather than self-promotion improves average manager quality by 0.6 standard deviations — equivalent to replacing an average worker in every group with a worker at the 99th percentile of individual productivity.

The three mechanisms through which good managers improve performance are: (1) monitoring — good managers (1 sd above average) cut monitoring errors from 16% to 8%; (2) optimal task allocation according to comparative advantage — groups with optimally assigned workers score 0.52 sd higher (p < 0.01); (3) worker motivation in late-stage effort — teams led by a 1-sd-above-average manager solve 0.6 more problems in the final two minutes versus only 0.3 more in the first two minutes.

The experiment was conducted in a university lab in the UK, and the sample skews toward graduate students with limited work experience. Generalizability to field settings is supported by prior evidence that peer productivity spillover experiments yield similar magnitudes in lab versus field settings, and that the estimated manager effects are similar to Lazear et al. (2015) estimates from a large employer dataset.

Q: What is the core methodological innovation of this paper? A: The paper requires repeated random assignment of managers to multiple teams, combined with controls for individual productive skill measured prior to group work. This allows identification of each manager’s average causal contribution to group output, rather than confounding management quality with team composition or individual worker ability. The key estimand is the standard deviation of individual manager effects (sigma_alpha), interpreted as the impact of having a manager one standard deviation above average.

Q: How large is the estimated manager effect, and how does it compare to worker effects? A: A manager one standard deviation above average improves team performance by approximately 0.23 standard deviations (p = 0.04 by randomization inference). This is roughly 90% the size of the combined productive skill effect of both workers together (approximately 0.26 sd), implying a good manager is nearly twice as valuable as a good individual worker. Without conditioning on production skills, the manager effect rises to 0.29 sd.

Q: What characteristics predict managerial performance among randomly assigned managers? A: Only two measures predict managerial performance in the lottery arm: fluid intelligence (CFIT) and economic decision-making skill (scores on the Assignment Game), both significant at below the 1% level. These predictors are robust to controls for demographics, education, work experience, emotional perceptiveness, and personality traits. Gender, age, and ethnicity do not predict managerial performance.

Q: What is the “Assignment Game” and why is it a strong predictor? A: The Assignment Game (Caplin et al., 2024) places participants in a simulated managerial role where they must assign fictional workers to tasks. Performing well requires understanding comparative advantage intuitively, managing an attentionally demanding numerical environment, and avoiding biases such as anchoring. The paper argues its strong predictive power reflects that good managers excel at allocating workers according to comparative advantage — which the experiment directly identifies as a key mechanism.

Q: How do self-promoted managers perform relative to lottery-assigned managers? A: Self-promoted managers perform approximately 0.10 standard deviations below lottery managers, and this gap is robust across model specifications. The performance deficit is roughly equivalent to being assigned a manager whose fluid intelligence is one full standard deviation below average. This finding implies that common organizational practice of selecting managers partly via self-nomination actively reduces team productivity.

Q: Why do self-promoted managers underperform? A: The paper attributes underperformance primarily to overconfidence. People strongly preferring management roles are significantly more overconfident than those without strong preferences (d = 0.41 sd, p < 0.01). Self-promoted managers specifically overestimate their social skills: among them, self-reported people skills are strongly negatively correlated with actual emotional perceptiveness on the RMET (r = -0.37, p < 0.001), and self-reported extraversion and political skill are negatively correlated with managerial performance (rho = -0.24 and -0.26, p < 0.05). None of these negative relationships appear among lottery managers.

Q: Who wants to be a manager, and does it differ by gender? A: The three variables most strongly correlated with wanting to be in charge are extraversion, risk appetite, and being male. The relationship between high extraversion and preference for management is driven largely by men. Women are much less likely to nominate themselves for leadership roles despite being equally or more effective on average — a finding consistent with broader experimental evidence on gender and leadership self-selection.

Q: How large are the potential gains from skill-based manager selection? A: Compared to self-promotion, selecting managers based on economic decision-making skill yields managers who are 0.6 standard deviations better in terms of estimated manager effects. In terms of group performance, this is equivalent to replacing an average worker in every group with a worker at the 99th percentile of individual productivity. Selecting on both economic decision-making and fluid intelligence outperforms random assignment, selection on social skills, or selection on worker task performance (the Peter Principle).

Q: What are the three mechanisms through which good managers improve team performance? A: First, monitoring: good managers (1 sd above average) reduce monitoring errors — defined as having a worker on a module substantially above the minimum score at task end — from 16% to 8% (bivariate correlation with manager performance = -0.40, p < 0.001). Second, optimal task allocation: the probability of finding the optimal comparative-advantage-based assignment is positively associated with manager performance (rho = 0.19, p < 0.01), and groups with always-optimal starting assignments score 0.52 sd higher than those with never-optimal assignments (p < 0.01). Third, worker motivation: team performance in the final two-minute period is about 50% more influential for overall outcomes than the first two minutes (p = 0.038), and 1-sd-above-average managers generate 0.6 more problems solved in the final period versus 0.3 in the first, consistent with differential motivational effects emerging over time.

Q: What is the Peter Principle, and how does this paper relate to it? A: The Peter Principle refers to the practice of promoting employees based on their performance as line workers rather than their suitability for management — promoting individuals to their level of incompetence. Benson et al. (2019) document this selection pattern empirically. This paper shows that selecting managers on worker task skill is inferior to selecting on economic decision-making skill or fluid intelligence, confirming that task skill is not the right criterion for manager selection even if it predicts individual worker output.

Q: How does the paper validate that manager effects are real and not noise? A: The paper uses randomization inference with 5,000 simulated allocations to compute p-values, obtaining p = 0.04 for the main manager effect. Robustness checks include controlling for pre-existing social relationships, manager risk appetite, variance of individual scores, and granular skill measures — all yielding estimates near 0.22 sd. A leave-one-out out-of-sample prediction test confirms manager contributions significantly predict held-out group performance (p < 0.01), while the analogous worker out-of-sample estimate is less than half the magnitude and not statistically significant.

Q: What are the scope conditions on the experimental results? A: The experiment is conducted in a university lab in the UK with graduate students averaging 25 years of age and two years of work experience, limiting direct generalizability to experienced workers or senior management. The task lasts approximately 15 minutes, which may not capture longer-run managerial dynamics. Compensation equalized average earnings between managers and workers, which differs from most real-world settings. The authors note their effect-size estimates closely match Lazear et al. (2015) from a large employer, and that Herbst and Mas (2015) find lab peer-productivity experiments generalize to the field.

Manager Effect (sigma_alpha): The standard deviation of individual managers’ average causal contributions to group performance, estimated via repeated random assignment and conditioning on individual productive skill. Represents the impact of having a manager one standard deviation above average, estimated at approximately 0.23 standard deviations of group output.

Collaborative Production Task: A novel lab group task in which a manager and two workers solve problems across three modules (numerical, spatial, analytical reasoning), with team score defined as the minimum module score (weakest-link structure). Managers are responsible for worker assignment, monitoring, and motivation; workers face no financial performance incentives.

Economic Decision-Making Skill: Defined by Caplin et al. (2024) as the ability to make good resource allocation decisions, assessed via the Assignment Game in which participants must optimally assign workers to tasks under comparative advantage. The single strongest predictor of managerial performance in the lottery arm.

Monitoring Failure: Defined in the paper as having any group member working on a module at task end whose score is substantially greater (e.g., 10 points higher) than the minimum module score — meaning the worker’s effort is not contributing to the group score. Occurs in 16% of groups overall; managers one sd above average reduce this to 8%.

Self-Promotion (as selection mechanism): A treatment condition in which the participant with the strongest stated preference for being manager (on a 1-10 scale) is assigned the managerial role. Contrasted with lottery assignment; self-promoted managers perform approximately 0.10 sd worse than lottery managers.

Overconfidence (in managerial context): The gap between self-assessed skill (particularly social/interpersonal skill) and objectively measured skill (e.g., RMET score). Self-promoters are significantly more overconfident (d = 0.41 sd), and overconfidence is strongly negatively correlated with actual emotional perceptiveness (r = -0.33, p < 0.001).

Comparative Advantage Allocation: The practice of assigning each worker to the module in which they have the highest relative (not absolute) performance advantage. Captured via whether a manager selects the optimal one-to-one assignment given pre-measured individual module scores; groups with always-optimal allocation score 0.52 sd higher.

The Power of Proximity to Coworkers

Mon, 01 Jan 0001 00:00:00 +0000

This paper studies how physical proximity to coworkers affects on-the-job training and productivity, using software engineers at a Fortune 500 online retailer observed from 2019 to 2024. The authors exploit two quasi-experimental shocks to proximity: the office closures of 2020, which eliminated proximity differentials that previously existed across team types, and the firm’s subsequent return-to-office (RTO) mandates in 2022 and 2023, which restored proximity for co-located teams while leaving geographically-distributed teams apart. The core identification strategy is a difference-in-differences design comparing engineers whose teams were co-located in a single headquarters building to those whose teams were split across two buildings a ten-minute walk apart — a distinction that became immaterial once offices closed.

The central finding is that sitting near teammates substantially increases the digital feedback engineers receive on their code. Before the office closures, engineers on co-located teams received 23.9% (1.92 comments per program) more code review feedback than engineers on multi-building teams. Once offices closed, this advantage narrowed by 18.3% (1.47 comments per program, p-value = 0.0026). The lost comments were disproportionately those predicted by a machine-learning classifier to be helpful, actionable, well-reasoned, and impactful, with high-quality comments declining by 21–23% — exceeding the overall volume decline. Face-to-face and digital communication are complements, not substitutes: proximate engineers drew on a wider pool of reviewers and asked 48.4% more follow-up questions, a differential that vanished once offices closed.

Proximity’s effects are highly heterogeneous. Gains in feedback are concentrated among less-tenured, younger, and female engineers — those with the most to learn. Junior engineers on co-located teams lost 2.03 more comments per program upon office closure than junior engineers already on distributed teams (p-value = 0.001); young engineers lost 2.47 more comments (p-value = 0.0001). Female engineers lost 38.9% more comments than their distributed female counterparts (p-value < 0.0001), partly because women stop asking as many people for feedback when they cannot do so in person.

Proximity improves code quality for inexperienced engineers. Around the second RTO (three days per week), engineers on co-located teams became 2.2 percentage points less likely to add files subsequently deleted — a measure of churn — and 1.4 pp less likely to introduce bugs, relative to distributed teams (p-values of 0.041 and 0.022 respectively). These gains were roughly twice as large for less-tenured and younger engineers. The benefits persist: engineers who spent more pre-closure time on co-located teams continued to write higher-quality code during the fully remote period.

However, mentorship is costly for those who provide it. Senior engineers on co-located teams wrote 0.76 fewer programs per month in the main codebase before closures (p-value = 0.0005), a gap that closed when offices did and widened again during the second RTO. The firm faces a fundamental tradeoff: proximity accelerates junior engineers’ human capital development while reducing experienced engineers’ immediate coding output.

These dynamics shape hiring. The firm shifted toward hiring older, more experienced engineers during closures — buying talent it could no longer build in-house — and back toward younger hires once offices reopened. Nationally, young college graduates in remotable occupations (classified per Dingel and Neiman, 2020) experienced a 0.88 pp increase in unemployment between 2017–2019 and 2022–2024, while older graduates saw a marginal decline of 0.11 pp. A triple-difference estimate finds a 0.65 pp greater increase in young workers’ unemployment in remotable versus non-remotable occupations (p-value = 0.029), a pattern that predates generative AI diffusion and is robust to controlling for AI exposure. Back-of-the-envelope, remote work accounts for an estimated 64% of the total unemployment increase among young college graduates over this period.

The paper also documents that proximity is fragile: a ten-minute walk between two buildings reduces feedback as much as being multiple states away, and even a single distant teammate imposes negative externalities on those who remain co-located, reducing their feedback by 1.71 comments per program (p-value = 0.095) via a “one Zoom, all Zoom” norm.

Q: What is the main identification strategy for the office-closure analysis, and what is the key parallel-trends evidence?

A: The authors compare engineers on co-located teams (all members in one headquarters building) to those on multi-building teams (split across two buildings a ten-minute walk apart), before and after the March 2020 office closures. Co-located teams lost more proximity when offices closed, while multi-building teams experienced a smaller shock, enabling a difference-in-differences design. Pre-closure trends in feedback are parallel across the two team types (Figure I), supporting the identifying assumption. Standard errors are clustered by team, the unit of treatment assignment.

Q: How large is the effect of proximity on total code review feedback, and how is it broken down by feedback source?

A: Before closure, co-located engineers received 23.9% (1.92 comments per program) more feedback than multi-building engineers. The DiD estimate indicates that losing proximity reduced feedback by 18.3% (1.47 comments per program, p-value = 0.0026, Column 3 of Table II). This decline stems entirely from reduced feedback from teammates; there is no detectable effect on feedback from engineers on other teams — a placebo check that supports the identification strategy and rules out explanations based on differential project complexity.

Q: How does proximity affect the quality — not just the quantity — of code review comments?

A: Using a gradient-boosted decision tree trained on 5,377 human-labeled comments, the authors predict comment quality across all 174,014 comments. Losing proximity reduced comments predicted to be helpful, well-reasoned, actionable, and likely to change the code by 21–23% — exceeding the 18.3% overall volume decline. The residual comments were lower quality: 2.9 pp fewer were helpful (p-value = 0.039), 1.7 pp fewer explained their reasoning (p-value = 0.094), and 1.9 pp fewer were likely to change the code (p-value = 0.072).

Q: What mechanisms drive the complementarity between face-to-face interaction and digital feedback?

A: Proximity increases feedback on both the extensive and intensive margins. On the extensive margin, co-located engineers draw on a wider pool of reviewers, returning less frequently to the same commenter. On the intensive margin, losing proximity reduces follow-up questions by 48.4% (0.12 questions per program, p-value = 0.0083), accounting for roughly half of the total feedback decline. The other half comes from reduced initial reviewer feedback. References to other communication channels (e.g., Slack) within code reviews also decline when proximity is lost, confirming that face-to-face and digital communication are complements.

Q: How small a physical barrier is sufficient to reduce feedback substantially?

A: A ten-minute walk between two buildings on the same headquarters campus reduces feedback by as much as being multiple states away — both groups receive significantly less feedback than engineers whose entire team sits in the same building (Figure Ib). This finding aligns with research on academics showing that different floors or buildings reduce coauthorship, and extends it to daily teammates sharing projects.

Q: What are the externality effects of a single distant teammate?

A: Through the firm’s implicit “one Zoom, all Zoom” norm, even one teammate in a different location shifts all team meetings to video calls. Engineers in the same building exchange 14.5% less feedback when even one teammate is in another building versus when all teammates are co-located (p-value = 0.037). When a new hire transforms a co-located team into a multi-building one, feedback between the original co-located teammates drops by 1.71 comments per program (p-value = 0.095); adding a new co-located hire produces no such decline.

Q: How does the effect of proximity on feedback differ by engineer tenure, age, and gender?

A: Less-tenured engineers on co-located teams lost 2.03 more comments per program upon closure than less-tenured engineers on distributed teams (p-value = 0.001). Young engineers (under 29) on co-located teams lost 2.47 more comments per program than young distributed engineers (p-value = 0.0001). Female engineers on co-located teams lost 38.9% (3.71) more comments than female engineers on distributed teams (p-value < 0.0001), partly because women draw feedback from 14.7% fewer people when proximity is lost (p-value = 0.0078), compared to a negligible 2.6% decline for men. The extra feedback women receive in person is of higher quality, not rude or condescending.

Q: How is the effect of proximity on code quality identified using the RTO design, and what are the magnitudes?

A: The RTO design compares engineers on co-located (same-city) teams to geographically-distributed teams across three periods: full closure, first RTO (two days per week), and second RTO (three days per week). The authors predict γ_closed ≈ 0 (office assignment irrelevant when closed) and γ_2nd_RTO > γ_1st_RTO (more in-office days means more proximity). Both predictions are confirmed. During the second RTO, co-located engineers were 2.2 pp less likely to add files later deleted (p-value = 0.041) and 1.4 pp less likely to introduce bugs (p-value = 0.022), with effects roughly twice as large for less-tenured and younger engineers.

Q: Does the benefit of co-location on code quality persist after remote work resumes?

A: Yes. After all engineers returned to remote work, those who had been on co-located teams pre-closure were 2.37 pp less likely to write disposable code (p-value = 0.013) and 3.09 pp less likely to introduce bugs (p-value = 0.0012). Code quality improves monotonically with the number of pre-closure months spent on co-located teams (Figure A.5). These gaps persist when including current team fixed effects, meaning within the same post-closure team, the previously co-located engineer writes higher-quality code.

Q: What is the cost of mentorship for senior engineers, and how does it manifest in coding output?

A: Senior engineers on co-located teams wrote 0.76 fewer programs per month in the main codebase when offices were open (p-value = 0.0005). Once offices closed, this gap disappeared, and senior engineers who lost proximity to their teammates saw a relative increase in output of 0.58 programs per month (p-value = 0.0014). During the second RTO, engineers with more than sixteen months of tenure on co-located teams wrote fewer programs, while no significant difference emerged for less-tenured engineers. Overall, the DiD estimate indicates losing proximity to teammates increases immediate output by 0.48 programs per month (p-value = 0.0002).

Q: How does the firm’s hiring age distribution respond to changes in proximity?

A: When offices were closed, the firm shifted toward hiring older engineers: the share of hires under age 29 fell from over half pre-closure to less than a third during the closure. After the RTOs, the firm shifted back toward younger hires. Geographic variation reinforces this: headquarters-campus hires were 7–10 years younger than those hired into distributed roles when offices were open; this gap narrowed substantially during closures when everyone was far from teammates.

Q: Does proximity affect which engineers are poached by other firms?

A: Yes. During the office closures, 1.2% of co-located engineers were poached per month, compared to 0.9% of multi-building engineers of similar tenure, age, and engineering group (p-value = 0.044). By the end of the closure period, nearly a quarter of co-located engineers had been poached versus a sixth of multi-building engineers. There is a dose response: more pre-closure time on co-located teams predicts higher poaching rates. The effect is concentrated among younger and female engineers, consistent with their feedback building more transferable general human capital. Tenure does not moderate the poaching effect, consistent with less-tenured engineers’ feedback being more firm-specific.

Q: What does national unemployment data show about the scarring effects of remote work on young workers?

A: Between 2017–2019 and 2022–2024, young college graduates (under 29) in remotable occupations experienced a 0.88 pp increase in unemployment (p-value < 0.00001), while older graduates in the same occupations saw a marginal decline of 0.11 pp (p-value = 0.053). A triple-difference regression finds a 0.65 pp greater increase in young workers’ unemployment in remotable versus non-remotable occupations (p-value = 0.029). Back-of-the-envelope, scaling this estimate by the 61% share of young graduates in remotable jobs predicts a 0.4 pp increase in young college graduates’ overall unemployment — equal to 64% of the realized 0.63 pp increase.

Q: Is the unemployment increase among young workers in remotable jobs driven by generative AI rather than remote work?

A: The authors argue against AI as the primary driver on two grounds. First, the uptick in young workers’ unemployment in remotable occupations predates the rapid diffusion of generative AI. Second, the differential increase is not concentrated among occupations with the highest AI task exposure. The triple-difference estimate is robust to controlling for occupational AI exposure using the Eisfeldt, Schubert and Zhang (2023) index. The authors acknowledge that AI may become more important as it diffuses further.

Q: How do young workers’ own office attendance decisions reflect the value of proximity?

A: At the partner firm, engineers under 29 were 8.8 pp (37.6%) more likely to come into the office during the RTOs than older engineers when on co-located teams (solid line in Figure VIIa). This difference was roughly halved on geographically-distributed teams (p-value of difference = 0.0085), indicating that the draw is specifically proximity to teammates. Co-located managers raised attendance by 2.6 pp, while co-located teammates raised it by 5.1 pp. Nationally, Stack Overflow survey data show nearly half of engineers under 25 are in the office each day, versus a quarter of older engineers (p-value < 0.00001).

Q: What does the paper imply about why remote work was rare before the pandemic despite workers’ stated preferences for it?

A: The paper offers a resolution: firms may have recognized that the value of the office lies in training for tomorrow and improving the quality — not the quantity — of work today. Remote work boosts immediate output, especially for experienced workers, but it reduces mentorship and long-run skill development. The tradeoff between current and future productivity, and between individual and collective returns to human capital, explains why firms historically resisted remote work even when workers preferred it and short-run output was unaffected.

Q: What are the implications for gender equity in remote work?

A: The findings suggest remote work has ambiguous gender effects. While remote work may help working mothers remain in the workforce, it appears costly for young women’s professional development, which is especially sensitive to physical proximity. Women receive substantially more high-quality feedback when co-located, draw feedback from a wider network in person, and lose disproportionately more feedback when proximity is lost. Young female engineers on co-located teams were also disproportionately poached — suggesting their human capital gains from co-location are more general and transferable.

Code review feedback: The digital comments engineers exchange when reviewing each other’s code before it is merged into the live codebase; the paper’s primary measure of on-the-job training and mentorship investment, distinct from mere volume because the authors also classify comments by helpfulness, reasoning, actionability, and expected impact using supervised machine learning.

Co-located team: A team in which all members are assigned to the same office building; the treatment group in the difference-in-differences designs, distinguished from multi-building teams (split across two headquarters buildings, a ten-minute walk apart) and geographically-distributed teams (members in different cities or permanently remote).

One Zoom, all Zoom norm: The implicit team practice of holding all meetings virtually if any single teammate cannot be physically present; the mechanism by which one distant colleague generates negative externalities for the remaining co-located teammates, reducing their in-person interaction and feedback.

Proximity fragility: The finding that even small physical barriers — a ten-minute walk between buildings — reduce feedback as much as being multiple states away, implying that the relationship between physical distance and mentorship is highly nonlinear near zero.

Churn (disposable code): Files that are added by an engineer but deleted within the subsequent six months, either because the code was poorly structured or because it introduced a feature later abandoned; used as one of two code quality proxies in the RTO analysis (occurring in 15% of programs).

Bugs (immediate reversions): Programs that are immediately and fully reverted after being merged, typically indicating the engineer’s changes precipitated an emergency requiring rollback to an earlier version; used as the more serious of the two code quality proxies (occurring in 3.5% of programs).

Scarring effects: The persistent adverse impact on young workers’ human capital and labor market outcomes from reduced mentorship during the remote work period; manifested both as lower code quality at the individual level and higher unemployment rates nationally among young college graduates in remotable occupations.

Remotable occupation: An occupation classified by Dingel and Neiman (2020) as feasibly performed from home; used to construct the national triple-difference analysis comparing age gaps in unemployment across remotable and non-remotable jobs before and after the pandemic.