Table of Contents
Fetching ...

GitHub Repository Complexity Leads to Diminished Web Archive Availability

David Calano, Michele C. Weigle, Michael L. Nelson

TL;DR

This paper evaluates how faithfully Web-hosted Git repositories are preserved in the Internet Archive's Wayback Machine, focusing on both presentational pages and archived source trees. By analyzing over 12,000 repository home pages and their source trees with Memento Damage, the authors reveal substantial preservation gaps: a significant fraction of home pages suffer damage or lack mementos, and archived source trees are markedly incomplete, with an average of less than 5% of source files archived. The study highlights the challenge posed by JavaScript-heavy loading and the structural depth of repositories, which hinder crawlers and reduce reconstructability of artifacts. The findings underscore the need for improved archival tooling, temporal coherence analyses, and potentially AI-assisted code synthesis to better preserve software for reproducibility, while advocating diversification of archival sources to avoid single-point failures.

Abstract

Software is often developed using versioned controlled software, such as Git, and hosted on centralized Web hosts, such as GitHub and GitLab. These Web hosted software repositories are made available to users in the form of traditional HTML Web pages for each source file and directory, as well as a presentational home page and various descriptive pages. We examined more than 12,000 Web hosted Git repository project home pages, primarily from GitHub, to measure how well their presentational components are preserved in the Internet Archive, as well as the source trees of the collected GitHub repositories to assess the extent to which their source code has been preserved. We found that more than 31% of the archived repository home pages examined exhibited some form of minor page damage and 1.6% exhibited major page damage. We also found that of the source trees analyzed, less than 5% of their source files were archived, on average, with the majority of repositories not having source files saved in the Internet Archive at all. The highest concentration of archived source files available were those linked directly from repositories' home pages at a rate of 14.89% across all available repositories and sharply dropping off at deeper levels of a repository's directory tree.

GitHub Repository Complexity Leads to Diminished Web Archive Availability

TL;DR

This paper evaluates how faithfully Web-hosted Git repositories are preserved in the Internet Archive's Wayback Machine, focusing on both presentational pages and archived source trees. By analyzing over 12,000 repository home pages and their source trees with Memento Damage, the authors reveal substantial preservation gaps: a significant fraction of home pages suffer damage or lack mementos, and archived source trees are markedly incomplete, with an average of less than 5% of source files archived. The study highlights the challenge posed by JavaScript-heavy loading and the structural depth of repositories, which hinder crawlers and reduce reconstructability of artifacts. The findings underscore the need for improved archival tooling, temporal coherence analyses, and potentially AI-assisted code synthesis to better preserve software for reproducibility, while advocating diversification of archival sources to avoid single-point failures.

Abstract

Software is often developed using versioned controlled software, such as Git, and hosted on centralized Web hosts, such as GitHub and GitLab. These Web hosted software repositories are made available to users in the form of traditional HTML Web pages for each source file and directory, as well as a presentational home page and various descriptive pages. We examined more than 12,000 Web hosted Git repository project home pages, primarily from GitHub, to measure how well their presentational components are preserved in the Internet Archive, as well as the source trees of the collected GitHub repositories to assess the extent to which their source code has been preserved. We found that more than 31% of the archived repository home pages examined exhibited some form of minor page damage and 1.6% exhibited major page damage. We also found that of the source trees analyzed, less than 5% of their source files were archived, on average, with the majority of repositories not having source files saved in the Internet Archive at all. The highest concentration of archived source files available were those linked directly from repositories' home pages at a rate of 14.89% across all available repositories and sharply dropping off at deeper levels of a repository's directory tree.

Paper Structure

This paper contains 8 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Breakdown of the differences between the version control software Git and the Git hosting platform GitHub.
  • Figure 2: Screenshot of the GitHub home page for the PhantomJS repository showing the embedded representation of the project's top-level source code in blue and the presentational elements of the project generated by GitHub in green.
  • Figure 3: Cropped screenshots of directory and source file Web pages for GitHub repository.
  • Figure 4: Highlights of missing image elements on two archived repository pages
  • Figure 5: Archived GitLab repository page for WhisperFish project. Only placeholder UI elements were captured instead of the true source table content that is loaded via JavaScript.
  • ...and 5 more figures