The disruption index is biased by citation inflation

Alexander M. Petersen; Felber Arroyave; Fabio Pammolli

The disruption index is biased by citation inflation

Alexander M. Petersen, Felber Arroyave, Fabio Pammolli

TL;DR

The observed decline in disruption over time in citation networks may reflect citation inflation rather than a true loss of disruptive impact. The authors combine a deductive analysis of the disruption metric $CD_p$, empirical evaluation on the MAG dataset, and a computational Monte Carlo model incorporating citation-inflation and triadic-closure dynamics, plus an openly available synthetic network ensemble. They show that as reference-list length $r(t)$ grows and extraneous citations accumulate ($R_k$), the denominator of $CD_p$ inflates and drives $CD_p$ toward 0, a bias that persists even for $CD_p^{nok}$; turning off citation inflation or capping references can restore time-stationary behavior, and the $CD_5$ distribution aligns with an Extreme Value law. The work provides an openly available resource to test alternative disruption indices, discusses normalization strategies for time-invariant comparisons, and offers policy considerations such as limiting reference list lengths to temper citation inflation.

Abstract

A recent analysis of scientific publication and patent citation networks by Park et al. (Nature, 2023) suggests that publications and patents are becoming less disruptive over time. Here we show that the reported decrease in disruptiveness is an artifact of systematic shifts in the structure of citation networks unrelated to innovation system capacity. Instead, the decline is attributable to 'citation inflation', an unavoidable characteristic of real citation networks that manifests as a systematic time-dependent bias and renders cross-temporal analysis challenging. One driver of citation inflation is the ever-increasing lengths of reference lists over time, which in turn increases the density of links in citation networks, and causes the disruption index to converge to 0. A second driver is attributable to shifts in the construction of reference lists, which is increasingly impacted by self-citations that increase in the rate of triadic closure in citation networks, and thus confounds efforts to measure disruption, which is itself a measure of triadic closure. Combined, these two systematic shifts render the disruption index temporally biased, and unsuitable for cross-temporal analysis. The impact of this systematic bias further stymies efforts to correlate disruption to other measures that are also time-dependent, such as team size and citation counts. In order to demonstrate this fundamental measurement problem, we present three complementary lines of critique (deductive, empirical and computational modeling), and also make available an ensemble of synthetic citation networks that can be used to test alternative citation-based indices for systematic bias.

The disruption index is biased by citation inflation

TL;DR

, empirical evaluation on the MAG dataset, and a computational Monte Carlo model incorporating citation-inflation and triadic-closure dynamics, plus an openly available synthetic network ensemble. They show that as reference-list length

grows and extraneous citations accumulate (

), the denominator of

inflates and drives

toward 0, a bias that persists even for

; turning off citation inflation or capping references can restore time-stationary behavior, and the

distribution aligns with an Extreme Value law. The work provides an openly available resource to test alternative disruption indices, discusses normalization strategies for time-invariant comparisons, and offers policy considerations such as limiting reference list lengths to temper citation inflation.

Abstract

Paper Structure (7 sections, 4 equations, 5 figures)

This paper contains 7 sections, 4 equations, 5 figures.

Quantitative definition of $CD$ and a deductive critique
Empirical critique
Computational critique
Generative network model featuring citation inflation and redirection
Computational simulation results
Discussion
Appendix: Reproduction of statistical regularities in a real-world citation network -- the Web of Science

Figures (5)

Figure 1: Empirical analysis of the disruption index. (a) Schematic of the disruption index calculation based upon the sub-network revolving around the source publication/patent $p$. The disruption index $CD_{p}$ can be calculated by identifying three non-overlapping subsets of $\{c\}_{p} = \{c\}_{i} \cup \{c\}_{j} \cup \{c\}_{k}$, of sizes $N_{i}$, $N_{j}$ and $N_{k}$, respectively. The subset $i$ refers to members of $\{c\}_{p}$ that cite the focal $p$ but do not cite any elements of $\{r\}_{p}$, and thus measures the degree to which $p$ disrupts the flow of attribution to foundational members of $\{r\}_{p}$. The subset $j$ refers to members of $\{c\}_{p}$ that cite both $p$ and $\{r\}_{p}$, measuring the degree of consolidation that manifests as triadic closure in the subnetwork (i.e., network triangles formed between $p$, $\{r\}_{p}$, $\{c\}_{j}$). The subset $k$ refers to members of $\{c\}_{p}$ that cite $\{r\}_{p}$ but do not cite $p$. (b) Average disruption index, $CD_{5}(t)$ calculated using a 5-year citation window based upon $29.5\times 10^{6}$ articles from the MAG dataset from 1945-2012. (c) Average number of references per paper per year, $r(t)$, which increased by a factor of 4 over the 6-year period shown. (d) Average extraneous citation rate, $R_{k}(t) \gg 1$ that is central to the critique of $CD$, and derives from the increasing citation count of highly-cited papers belonging to the reference list $\{r\}_{p}$ which systematically inflates the size of the extraneous set $\{c\}_{k}$. (inset) $R_{k}(t)$ grows roughly proportional to $r(t)$. (e) Results of linear regression model implemented in STATA 13 for dependent variable $CD_{p,5}$, controlling for $r_{p}$ and secular growth by way of yearly fixed-effects. Publication years are within the 20-year range 1990-2009; covariates are included following a logarithmic transform. (d) Marginal effects calculated with all other covariates held at their mean values, showing that $CD_{5}$ is negatively correlated with the log of the number of references, $\ln r_{p}$. (e) $CD_{5}$ is positively correlated with the log of the number of coauthors, $\ln k_{p}$.
Figure 2: Numerical simulation of growing citation networks elucidates roles of citation inflation and strategic citation practice. (a) Model system evolved over $T=150$ periods (representing years), using growth parameters estimated for the entire Clarivate Analytics Web of Science citation network pan2016memory. (b) Schematic of the citation model comprised of two citing mechanisms: (i) direct citations, and (ii) redirected citations made via the reference list $\{r\}_{b}$ of an intermediate item $b$. Type (ii) references give rise to triadic closure corresponding to the $N_{j}$ factor in $CD_{p}$. (c) The rate of type (ii) references is controlled by the parameter $\beta(t)$, which quantifies the fraction of links in the citation network directly following this 'consolidation' mechanism funk2017dynamicpark2023papers, which yields more negative $CD_{p}$ values. To disentangle the roles of citation inflation (owing to $g_{r}>0$) from shifts in scholarly citation practice (owing to $\partial_{t}\beta(t)>0$), we compare four scenarios: scenarios (1,2) (gray and black curves) feature no citation inflation $(g_{r}=0)$; (2,3) compare $\beta(t)=0$ and $\beta(t) = t/400$; and (3,4) (cyan and blue curves) compare the effects of different citation windows (CW). (d) Each curve is the average $CD_{CW}(t)$ calculated for a single synthetic network. (e) Average $CD^{\text{nok}}_{5}(t)$. (f) $R_{k}(t)$ is the average rate of extraneous citations, which increases as either $r(t)$ or CW increase. (inset) High linear correlation between $r(t)$ and $R_{k}(t)$ shows that the decreasing trend in $CD(t)$ is largely attributable to citation inflation. (g) The average value of $N_{ij}(t) = N_{i} +N_{j}$ (which defines the denominator of $CD^{\text{nok}}$) also systematically increases, and so neglecting the term $N_{k}$ does not solve the fundamental issue of CI.
Figure 3: Hypothetical publishing policy intervention reveals effect of capped reference list lengths on $CD$. (a) Evolution of network size in scenarios (5,6) where the number of references per paper is capped at $r(t\geq T^{*}) = 25$ after $T^{*}=92$, such that the growth in the total citations produced per year depends solely on the growth of $n(t)$. (b) Average $CD_{CW}(t)$ for scenarios (3)-(6). Immediately after $T^{*}=92$ the $CD(t)$ trends for intervention scenarios (5,6) reverse from decreasing to increasing. (c) The divergence in $CD(t)$ trends is attributable to the taming of CI which stabilizes $R_{k}(t)$. (d) The frequency distribution $P_{t}(CD_{5})$ aggregated over 10-period intervals indicated by the color gradient; vertical dashed lines indicate distribution mean. Comparing with Fig. \ref{['FigureS2.fig']}(a), it is clear that the $P_{t}(CD_{5})$ distribution for Scenario (5) becomes significantly more stable $t \geq T^{*}$, with variation due to the residual citation inflation associated with publication volume growth ($g_{n}$). (e) The stability of the $P_{t}(CD_{5})$ distribution after $T^{*}$ suggests that quantitative properties of the Extreme Value (Fisher-Tippett) distribution could be used to develop time-invariant disruption measures; orange curves represent the best-fit Fisher-Tippett distribution model.
Figure A1: Citation redirection model -- reproduction of empirical statistical regularities that characterize real citation patterns. Figure and caption reproduced with permission from Pan et al. pan2016memory. Shown are various properties of the synthetic citation network that can be compared with empirical trends. We evolved the simulation using the parameters: $T\equiv$ 200 MC periods ($\sim$ years), $n(0)\equiv$ 10 initial publications, $r(0)\equiv$ 1 initial references, exponential growth rates $g_{n} \equiv 0.033$ and $g_{r} \equiv 0.018$, secondary redirection parameter $\beta \equiv 1/5$ (corresponding to $\lambda=1/4$), citation offset $C_{\times}\equiv6$, and life-cycle decay factor $\alpha \equiv 5$. At the final period $t=T$, the final cohort has size $n(T)=7112$ new publications, $r(T)=35$ references per publication, and final citation network size $N(200)$ = 218,698 publications (nodes) and $R(T)$ = 5,025,106 total references/citations (links). (a) The size of the system in each MC period $t$. (b) Growth of the mean reference distance $\langle \Delta_{r} \rangle$. (c) The fraction $f_{c\leq C}(t \vert \tau=5)$ of publications which have $C$ or less citations at cohort age $\tau=5$. (d) The citation life cycle, measured here by the mean number of new citations $\tau$ periods after entry (publication). The different curves correspond to the publication cohort entry period $t$. For sufficiently large $t$ the life cycle decays exponentially. (e) Growth of the logarithmic mean (location) value $\mu_{LN,t}$ and the relative stability of the logarithmic standard deviation (scale) value $\sigma_{LN,t}$. $\mu_{LN,t}=\langle \log (c_{p,t}) \rangle$ and $\sigma_{LN,t}= \sigma[ \log (c_{p,t})]$ are the logarithmic mean and standard deviation calculated across all $p$ within each age cohort $t$. (f) The distribution $P(z_{p,t})$ of the normalized citation impact $z_{p,t}$. For visual comparison we plot the Normal distribution $N(\mu=0,\sigma=1)$. (G) The increasing citation share $f_{\sum c}$ -- the fraction of the total citations received by all publications from cohort $t$ -- of the top $1\%$ of publications from cohort $t$ (ranked at cohort age $\tau=10$). (h) The decreasing citation share $f_{\sum c}$ of the bottom $75\%$ of publications. (i) The cumulative citation count $c_{p}(t)$ of the top 200 publications $(p)$ from the interval $t=[170,179]$, ranked according to $c_{p}(t=180)$. The dashed line represents the average citations for $p$ from the same cohort over the same period.
Figure A2: The distribution of $CD_{5}$ derived from synthetic citation networks follows the Extreme Value (Fisher-Tippett) distribution but is not stable over time. (a) The probability density function $P_{t}(CD_{5})$ is calculated using values aggregated over 10-period intervals indicated by $t$, with color gradient indicating each 10-period interval. Vertical dashed lines indicate distribution mean. (b) Each 10-period $P_{t}(CD_{5})$ is shown with the best-fit Extreme Value (Fisher-Tippett) distribution (orange curve), estimated using Mathematica 13.1 algorithm FindDistribution. The Extreme Value distribution is a better fit as $t$ increases, pointing to a strategy for normalizing $CD_{5}$ that supports cross-temporal analysis in the same way that the properties of the log-normal distribution can be used to normalize citation counts collected over different periods petersen_reputation_2014petersen_citationinflation_2018HBP_2020. (c) Distribution of an alternative disruption index, $P_{t}(CD^{\text{nok}}_{5} )$, calculated using same temporal periods as in (a), shows that the vast majority of publications according to this measure are highly disruptive.

The disruption index is biased by citation inflation

TL;DR

Abstract

The disruption index is biased by citation inflation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)