Table of Contents
Fetching ...

Cited But Not Archived: Analyzing the Status of Code References in Scholarly Articles

Emily Escamilla, Martin Klein, Talya Cooper, Vicky Rampin, Michele C. Weigle, Michael L. Nelson

TL;DR

This study addresses the risk of reference rot in scholarly code references hosted on Git Hosting Platforms (GHPs) by quantifying long-term preservation across live Web, Software Heritage, and Web archives. Using 253,590 GHP URIs from 2.6 million arXiv/PMC articles, the authors test live availability, Software Heritage capture, and Web-archive mementos, and categorize URIs as vulnerable, replicated, or unrecoverable with timing analyses of first captures. Key findings show 93.98% of URIs remain live, 68.39% are archived by Software Heritage, 81.43% have Web-archive copies, and 57.21% are preserved by both; 12.99% are not archived and 32.36% of rotten URIs are unrecoverable, with variability across GHPs and notable delays between publication and capture. The work highlights gaps in archival coverage, demonstrates that Web archives currently offer broader preservation than Software Heritage for scholarly code references, and calls for proactive code submission to archives to enhance reproducibility and long-term access.

Abstract

One in five arXiv articles published in 2021 contained a URI to a Git Hosting Platform (GHP), which demonstrates the growing prevalence of GHP URIs in scholarly publications. However, GHP URIs are vulnerable to the same reference rot that plagues the Web at large. The disappearance of software hosting platforms, like Gitorious and Google Code, and the source code they contain threatens research reproducibility. Archiving the source code and development history available in GHPs enables the long-term reproducibility of research. Software Heritage and Web archives contain archives of GHP URI resources. However, are the GHP URIs referenced by scholarly publications contained within the Software Heritage and Web archive collections? We analyzed a dataset of GHP URIs extracted from scholarly publications to determine (1) is the URI still publicly available on the live Web?, (2) has the URI been archived by Software Heritage?, and (3) has the URI been archived by Web archives? Of all GHP URIs, we found that 93.98% were still publicly available on the live Web, 68.39% had been archived by Software Heritage, and 81.43% had been archived by Web archives.

Cited But Not Archived: Analyzing the Status of Code References in Scholarly Articles

TL;DR

This study addresses the risk of reference rot in scholarly code references hosted on Git Hosting Platforms (GHPs) by quantifying long-term preservation across live Web, Software Heritage, and Web archives. Using 253,590 GHP URIs from 2.6 million arXiv/PMC articles, the authors test live availability, Software Heritage capture, and Web-archive mementos, and categorize URIs as vulnerable, replicated, or unrecoverable with timing analyses of first captures. Key findings show 93.98% of URIs remain live, 68.39% are archived by Software Heritage, 81.43% have Web-archive copies, and 57.21% are preserved by both; 12.99% are not archived and 32.36% of rotten URIs are unrecoverable, with variability across GHPs and notable delays between publication and capture. The work highlights gaps in archival coverage, demonstrates that Web archives currently offer broader preservation than Software Heritage for scholarly code references, and calls for proactive code submission to archives to enhance reproducibility and long-term access.

Abstract

One in five arXiv articles published in 2021 contained a URI to a Git Hosting Platform (GHP), which demonstrates the growing prevalence of GHP URIs in scholarly publications. However, GHP URIs are vulnerable to the same reference rot that plagues the Web at large. The disappearance of software hosting platforms, like Gitorious and Google Code, and the source code they contain threatens research reproducibility. Archiving the source code and development history available in GHPs enables the long-term reproducibility of research. Software Heritage and Web archives contain archives of GHP URI resources. However, are the GHP URIs referenced by scholarly publications contained within the Software Heritage and Web archive collections? We analyzed a dataset of GHP URIs extracted from scholarly publications to determine (1) is the URI still publicly available on the live Web?, (2) has the URI been archived by Software Heritage?, and (3) has the URI been archived by Web archives? Of all GHP URIs, we found that 93.98% were still publicly available on the live Web, 68.39% had been archived by Software Heritage, and 81.43% had been archived by Web archives.
Paper Structure (6 sections, 5 figures, 2 tables)

This paper contains 6 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Results of running the three tests: (1) is the URI active?, (2) has the URI been archived by Software Heritage?, and (3) has the URI been archived by Web archives?
  • Figure 2: Relationships between URIs that have been archived by Software Heritage (SWH) and Web archives, only Software Heritage, only Web archives, and neither Software Heritage or Web archives
  • Figure 3: Relationships between rotten URIs that have been archived by Software Heritage (SWH) and Web archives, only Software Heritage, only Web archives, and neither Software Heritage or Web archives
  • Figure 4: Months between a publication referencing a URI and the URI being captured by Software Heritage over time. Only includes URIs not been captured by Software Heritage before the publication date of the referencing article.
  • Figure 5: Number of months between a publication referencing a URI and the URI being captured by the Web archives over time. Only includes URIs not captured by the Web archives before the publication date of the referencing article.