Scientific productivity as a random walk

Sam Zhang; Nicholas LaBerge; Samuel F. Way; Daniel B. Larremore; Aaron Clauset

Scientific productivity as a random walk

Sam Zhang, Nicholas LaBerge, Samuel F. Way, Daniel B. Larremore, Aaron Clauset

TL;DR

The paper addresses why the canonical early-career rise and late-career decline in scientific productivity appears in averages despite wide heterogeneity in individual trajectories. It shows that modeling productivity as a discrete-time random walk with career-stage–dependent variance—where early careers have higher variance than later ones—reproduces the canonical trajectory and captures much of the observed variability. A simplified two-stage model demonstrates the mechanism ($\alpha_1>\alpha_2$) by which the aggregate pattern emerges, while a full model with inferred four career stages fits empirical distributions of $q_t$ and $\delta_t$ and closely matches aggregate and several individual-level features, albeit with some residual gaps likely due to non-Markovian factors. These findings highlight the role of contingent factors and variance dynamics in shaping scientific productivity and suggest policy avenues to steer incentives and opportunities while acknowledging substantial randomness in research trajectories.

Abstract

The expectation that scientific productivity follows regular patterns over a career underpins many scholarly evaluations, including hiring, promotion and tenure, awards, and grant funding. However, recent studies of individual productivity patterns reveal a puzzle: on the one hand, the average number of papers published per year robustly follows the "canonical trajectory" of a rapid rise to an early peak followed by a gradual decline, but on the other hand, only about 20% of individual productivity trajectories follow this pattern. We resolve this puzzle by modeling scientific productivity as a parameterized random walk, showing that the canonical pattern can be explained as a decrease in the variance in changes to productivity in the early-to-mid career. By empirically characterizing the variable structure of 2,085 productivity trajectories of computer science faculty at 205 PhD-granting institutions, spanning 29,119 publications over 1980--2016, we (i) discover remarkably simple patterns in both early-career and year-to-year changes to productivity, and (ii) show that a random walk model of productivity both reproduces the canonical trajectory in the average productivity and captures much of the diversity of individual-level trajectories. These results highlight the fundamental role of a panoply of contingent factors in shaping individual scientific productivity, opening up new avenues for characterizing how systemic incentives and opportunities can be directed for aggregate effect.

Scientific productivity as a random walk

TL;DR

) by which the aggregate pattern emerges, while a full model with inferred four career stages fits empirical distributions of

and

and closely matches aggregate and several individual-level features, albeit with some residual gaps likely due to non-Markovian factors. These findings highlight the role of contingent factors and variance dynamics in shaping scientific productivity and suggest policy avenues to steer incentives and opportunities while acknowledging substantial randomness in research trajectories.

Abstract

Paper Structure (9 sections, 1 equation, 4 figures)

This paper contains 9 sections, 1 equation, 4 figures.

Introduction
Data
Results
Distribution of productivity changes
Modeling the canonical trajectory
Modeling empirical productivity trajectories
Discussion
Acknowledgments
Author contributions

Figures (4)

Figure 1: Empirical productivity data. (A) An exponential distribution (dashed black line) accurately fits the empirical first-year productivity (pink histogram). The inset displays the estimated rate parameter against the density of estimated rates in 1,000 bootstrap replicas. (B-D) The empirical distributions of productivity changes (pink histograms) are semi-log plots, for ranges of career age, along with fitted Laplace distributions (dashed black line). (E) The average productivity for the same set of researchers, showing the "canonical trajectory" of a rapid rise followed by a gradual decline or leveling off, depicted as means of time-adjusted productivity for each career age and 95% bootstrap confidence intervals. Brackets indicate the range of career ages that were grouped together for the density plots: (A) productivity in year zero, and then changes of productivity in (B) years 1--4, (C) years 5--7, and (D) years 8--20.
Figure 2: Reproducing canonical trajectories with a simplified model. (A) Simulating $N=400$ trajectories for each pair of $\alpha_1$ and $\alpha_2$ with $\mu=-1$ fixed, we display the fraction of those trajectories that are canonical. Some regions of the parameter space generate non-canonical trajectories (B, D, E), while others generate more canonical trajectories on average (C, F). Shaded intervals denote pointwise 95% confidence intervals for $N=1000$ simulations at those parameters.
Figure 3: Fitting the empirical data (A) Average productivity by career year for real and simulated trajectories, where shaded ribbons denote 95% confidence intervals. Dashed gray lines denote estimated career change points (at years 4, 7, and 13). Above, the bootstrap distribution of change points across $1000$ bootstrap iterations, where bootstrap is conducted at the individual level. (B) Distribution of the years with greatest productivity among the full empirical and simulated trajectories. Distributions are similar across the entire career ($\text{KS}=0.04$; $p=0.44$).
Figure 4: Comparing the random walk model to empirical data. (A) Distributions of within-career standard deviations of productivity, for full empirical and simulated trajectories showing that empirical productivity variation tends to be slightly smaller ($\text{KS}=0.14$; $p < 0.001$), even if we omit zeros. (B) The distribution of annual productivities (full trajectories), showing a close match for all values except at zero between empirical and simulated careers. Black bars indicate the binomial 95% Wald confidence intervals for the probability of zero publications. (C) Distributions of productivity of empirical and simulated trajectories at career year 5. Inside the violin plot, the white circle indicates the median, the thick bar indicates the interquartile range, and the thin bar indicates the centered 95% containment interval. By career year 5, the simulated trajectories tend to have fewer publications than the empirical trajectories on average ($t=0.16$; $p < 0.001$), and the difference is especially pronounced among the tail of the most productive individuals. (D) Distributions of career years with zero publications within full empirical and simulated trajectories. The distribution of simulated and empirical trajectories with exactly one zero is similar, but more empirical trajectories exhibit more than one zeros than the simulated trajectories.

Scientific productivity as a random walk

TL;DR

Abstract

Scientific productivity as a random walk

Authors

TL;DR

Abstract

Table of Contents

Figures (4)