GitHub Repository Complexity Leads to Diminished Web Archive Availability
David Calano, Michele C. Weigle, Michael L. Nelson
TL;DR
This paper evaluates how faithfully Web-hosted Git repositories are preserved in the Internet Archive's Wayback Machine, focusing on both presentational pages and archived source trees. By analyzing over 12,000 repository home pages and their source trees with Memento Damage, the authors reveal substantial preservation gaps: a significant fraction of home pages suffer damage or lack mementos, and archived source trees are markedly incomplete, with an average of less than 5% of source files archived. The study highlights the challenge posed by JavaScript-heavy loading and the structural depth of repositories, which hinder crawlers and reduce reconstructability of artifacts. The findings underscore the need for improved archival tooling, temporal coherence analyses, and potentially AI-assisted code synthesis to better preserve software for reproducibility, while advocating diversification of archival sources to avoid single-point failures.
Abstract
Software is often developed using versioned controlled software, such as Git, and hosted on centralized Web hosts, such as GitHub and GitLab. These Web hosted software repositories are made available to users in the form of traditional HTML Web pages for each source file and directory, as well as a presentational home page and various descriptive pages. We examined more than 12,000 Web hosted Git repository project home pages, primarily from GitHub, to measure how well their presentational components are preserved in the Internet Archive, as well as the source trees of the collected GitHub repositories to assess the extent to which their source code has been preserved. We found that more than 31% of the archived repository home pages examined exhibited some form of minor page damage and 1.6% exhibited major page damage. We also found that of the source trees analyzed, less than 5% of their source files were archived, on average, with the majority of repositories not having source files saved in the Internet Archive at all. The highest concentration of archived source files available were those linked directly from repositories' home pages at a rate of 14.89% across all available repositories and sharply dropping off at deeper levels of a repository's directory tree.
