Dataset Artefacts are the Hidden Drivers of the Declining Disruptiveness in Science

Vincent Holst; Andres Algaba; Floriano Tori; Sylvia Wenmackers; Vincent Ginis

Dataset Artefacts are the Hidden Drivers of the Declining Disruptiveness in Science

Vincent Holst, Andres Algaba, Floriano Tori, Sylvia Wenmackers, Vincent Ginis

TL;DR

This reanalysis shows that the reported decline in disruptiveness can be attributed to a relative decline of these database entries with zero references, and proper evaluation of the Monte-Carlo simulations reveals that even random citation behaviour replicates the observed decline in disruptiveness.

Abstract

Park et al. [1] reported a decline in the disruptiveness of scientific and technological knowledge over time. Their main finding is based on the computation of CD indices, a measure of disruption in citation networks [2], across almost 45 million papers and 3.9 million patents. Due to a factual plotting mistake, database entries with zero references were omitted in the CD index distributions, hiding a large number of outliers with a maximum CD index of one, while keeping them in the analysis [1]. Our reanalysis shows that the reported decline in disruptiveness can be attributed to a relative decline of these database entries with zero references. Notably, this was not caught by the robustness checks included in the manuscript. The regression adjustment fails to control for the hidden outliers as they correspond to a discontinuity in the CD index. Proper evaluation of the Monte-Carlo simulations reveals that, because of the preservation of the hidden outliers, even random citation behaviour replicates the observed decline in disruptiveness. Finally, while these papers and patents with supposedly zero references are the hidden drivers of the reported decline, their source documents predominantly do make references, exposing them as pure dataset artefacts.

Dataset Artefacts are the Hidden Drivers of the Declining Disruptiveness in Science

TL;DR

Abstract

Paper Structure (8 sections, 5 equations, 13 figures, 2 tables)

This paper contains 8 sections, 5 equations, 13 figures, 2 tables.

Extended Data
The CD index
Regression adjustment
Monte Carlo simulations
DBLP citation network
Different forward citation windows
Normalized $\mathrm{CD}_5$ indices
Random paper and patent samples

Figures (13)

Figure 1: $$ Distribution of the $\mathbf{\mathrm{CD}_5}$ index with vs without the hidden outliers and its impact on the apparent decline of disruptive science and technology. This figure shows that $\mathrm{CD}_5=1$ papers and patents are driving the reported decline in the disruptiveness of scientific and technological knowledge over time for the Web of Science data source (with $22,479,429$ papers) and the PatentsView data source (with $2,926,923$ patents). For PatentsView, we also have access to sufficient metadata to exclude patents that make zero references, similarly impacting the decline. a, The distribution of the $\mathrm{CD}_5$ index for papers in Web of Science as presented in Park et al. park2023papers, created using the binwidth parameter in seaborn 0.11.2. This version of the library contains a bug regarding silently dropping the largest data points ($1$ in this case) when specifying the binwidth parameter seaborn2023pullrequest. b, The correct histogram for papers when using the bins parameter in seaborn 0.11.2. A peak at $\mathrm{CD}_5=1$ is revealed with $972,161$ additional papers. c, The time evolution of the average $\mathrm{CD}_5$ index for papers. When dropping the hidden outliers with $\mathrm{CD}_5=1$, the decline in disruptiveness almost completely disappears. The shaded bands correspond to $95\%$ confidence intervals. Finally, note that the curve without $\mathrm{CD}_5=1$ papers corresponds to (a), the histogram presented in Park et al. park2023papers. d--f, The equivalent plots for PatentsView revealing $142,362$ additional patents with $\mathrm{CD}_5=1$. When dropping the outliers with $\mathrm{CD}_5=1$, the decline in disruptiveness reduces substantially. Unlike Web of Science, the PatentsView data source provided sufficient metadata to exclude patents with zero references, similarly impacting the data as removing outliers with $\mathrm{CD}_5=1$ (Fig. \ref{['fig2']} and Extended Data Fig. \ref{['Extendedfig2']}). Finally, note again that the curve without $\mathrm{CD}_5=1$ patents corresponds to (d), the histogram presented in Park et al. park2023papers.
Figure 2: $$ The reason why the robustness checks in Park et al. park2023papers failed to detect the consequences of the hidden outliers. This figure displays how the Park et al. park2023papers regression adjustment (models $4$ and $8$ in Supplementary Table $1$ in park2023papers) fails to control for the discontinuous effect of zero references and that randomly rewired citation networks exhibit a similar temporal decline of $\mathrm{CD}_5$. Results are shown for PatentsView (a, c, e; $n=2,926,923$ patents) using the original Park et al. park2023papers data and SciSciNetlin2023sciscinet (b, d, f; $n=39,888,199$ papers), replicating their Web of Science analysis. Shaded bands correspond to $95 \%$ confidence intervals. a, The distribution of the $\mathrm{CD}_5$ per number of references is shown via letter-value plots which first identify the median, then extend boxes outward, each covering half of the remaining data hofmann2017value. Notably, in the case of zero references, the CD index is either one or remains undefined, causing a discontinuity. The marginal effect of references on $\mathrm{CD}_5$ shows that the regression adjustment of Park et al. park2023papers fails to account for this discontinuity. c, The root mean squared errors (RMSE) show a pattern between the Park et al. park2023papers regression residuals and the number of references, showing that the model does not properly control for the discontinuous effect of zero references. Adding a dummy variable for zero references substantially improves the model fit as depicted by the adjusted $\mathrm{R}^2$, while a similar effect is not found for other reference dummy variables. e, The average $\mathrm{CD}_5$ of the rewired patent networks (mean over ten runs) mirrors the decline of the observed network over time. This close similarity is the result of the one-to-one correspondence between zero reference patents within the observed and simulated networks, as evidenced by the peak at one in the histogram of the rewired $\mathrm{CD}_5$ shown in the inset plot. Finally, note that the gap between the observed $\mathrm{CD}_5$ values and those from the simulated networks is becoming smaller over time, which implies that the decline in the $z$ score found by Park et al. park2023papers and shown in the inset is the result of a decreasing standard deviation. b, d, f, The analogous, replicated plots for SciSciNet.
Figure 3: $$ Distribution of the $\mathrm{CD}_5$ index with vs without the hidden outliers and its impact on the disruptiveness for the SciSciNet data source. This figure replicates the observation that papers with $\mathrm{CD}_5 = 1$ are driving the decline in disruptive science for the SciSciNet data source lin2023sciscinet (with $39,888,199$ papers between $1944$ and $2011$), which originated from the Microsoft Academic Graph. a, The distribution of the $\mathrm{CD}_5$ index for SciSciNet, created using the binwidth parameter in seaborn 0.11.2. Here again, the largest data points are hidden. b, The correct histogram of the underlying dataset. A peak at $\mathrm{CD}_5=1$ is revealed, corresponding to $8,861,343$ additional papers. c, The time evolution of the average $\mathrm{CD}_5$ index. When dropping the outliers with $\mathrm{CD}_5=1$, the decline in disruptiveness is negated. Excluding papers with zero references impacts the data similarly (Fig. \ref{['fig2']} and Extended Data Fig. \ref{['Extendedfig2']}). The shaded bands correspond to $95\%$ confidence intervals. Moreover, the curve with papers with $\mathrm{CD}_5 = 1$ omitted is the curve corresponding to the histogram (a).
Figure 4: $$ Papers and patents with $\mathrm{CD}_5=1$ predominantly make zero references. This figure displays that most papers in the SciSciNet data source lin2023sciscinet ($n=39,888,199$) and most patents in the PatentsView data source ($n=2,926,923$) with $\mathrm{CD}_5=1$ have zero references. a, Our analysis shows that PatentsView contains $142,362$ patents with $\mathrm{CD}_5=1$ between $1980$ and $2010$, of which $78 \: \%$ appear in the database with zero references. b, Within the category of patents with $\mathrm{CD}_5 = 1$, the relative frequency of patents with zero references is stable between $1980$ and $2010$. c, The relative frequency of patents with $\mathrm{CD}_5$ index exactly equal to one and zero references is decreasing over time. Therefore, a substantial part of the reported decline in the disruptiveness of technological knowledge over time can be attributed to a relatively increasing metadata quality over time. It is also intriguing to note how well the shape of this curve resembles the shape of the top curve shown in Fig. \ref{['fig1']}f. d,SciSciNetlin2023sciscinet shows a similar behaviour with $8,861,343$ papers having $\mathrm{CD}_5=1$ between $1944$ and $2011$, of which $97 \: \%$ appear in the database with zero references. e, Within the category of papers with $\mathrm{CD}_5=1$, the relative frequency of papers with zero references is stable between $1944$ and $2011$. f, The relative frequency of papers with $\mathrm{CD}_5$ index exactly equal to one and zero references is decreasing over time. Therefore, a substantial part of the observed decline in the disruptiveness of scientific knowledge over time can be attributed to a relatively increasing metadata quality over time. It is also intriguing to note how well the shape of this curve resembles the shape of the top curve shown in Extended Data Fig. \ref{['Extendedfig1']}c.
Figure 5: $$ Across various data sources and within different categories, papers and patents with $\mathrm{CD}_5=1$ are driving the decline in the disruptiveness in scientific and technological knowledge over time. This figure displays the average $\mathrm{CD}_5$ index over time for six data sources and five different patent categories. The data sources are JSTOR ($1,588,088$ papers), the American Physical Society corpus ($461,359$ papers), Microsoft Academic Graph (random sample of $1,000,000$ papers), and PubMed ($1,563,211$ papers). For reference, the Web of Science ($22,479,429$ papers) and PatentsView ($2,926,923$ patents) data sources are also included. The patent categories are Chemical ($517,964$ patents), Computers and communications ($748,849$ patents), Drugs and medical ($321,449$ patents), Electrical and electronic ($734,769$ patents) and Mechanical ($603,892$ patents). Shaded bands correspond to $95\%$ confidence intervals. a, The temporal evolution of the average $\mathrm{CD}_5$ index for different data sources as presented in Park et al. park2023papers (Extended Data Fig. 6 in park2023papers). b, The time evolution of the average $\mathrm{CD}_5$ index for different data sources after removing the outliers with $\mathrm{CD}_5 = 1$ from the data sources. For all mentioned data sources that encompass papers, the decline in the disruptiveness almost completely disappears. For the PatentsView data source, the decline in the disruptiveness also reduces notably. c, The time evolution of the average $\mathrm{CD}_5$ index for different patent categories as presented in Park et al. park2023papers (Fig. 2b in park2023papers). d, The time evolution of the average $\mathrm{CD}_5$ index for different patent categories after removing the outliers with $\mathrm{CD}_5 = 1$ from the categories. We see that the decline in disruptiveness reduces similarly across all five categories.
...and 8 more figures

Dataset Artefacts are the Hidden Drivers of the Declining Disruptiveness in Science

TL;DR

Abstract

Dataset Artefacts are the Hidden Drivers of the Declining Disruptiveness in Science

Authors

TL;DR

Abstract

Table of Contents

Figures (13)