Table of Contents
Fetching ...

Towards Semantic Versioning of Open Pre-trained Language Model Releases on Hugging Face

Adekunle Ajibode, Abdul Ali Bangash, Filipe Roseiro Cogo, Bram Adams, Ahmed E. Hassan

TL;DR

The paper investigates how open PTLM releases on Hugging Face are named, versioned, and documented, revealing pervasive inconsistencies and a lack of semantic versioning. Using a mixed-methods approach, it analyzes $52{,}227$ PTLMs, identifies $148$ naming conventions across $12$ segment-types, and shows that only a minority of releases explicitly signal variant-type or training-dataset provenance. It demonstrates widespread implicit versioning in model binaries and extensive gaps in model cards and dataset metadata, arguing for a multidimensional, provenance-rich semantic versioning framework for PTLMs. The work outlines concrete recommendations for standardizing naming, enhancing metadata, and providing tools (e.g., version calculators, SBOM-like practices) to improve reproducibility, trust, and interoperability in the model registry ecosystem.

Abstract

The proliferation of open Pre-trained Language Models (PTLMs) on model registry platforms like Hugging Face (HF) presents both opportunities and challenges for companies building products around them. Similar to traditional software dependencies, PTLMs continue to evolve after a release. However, the current state of release practices of PTLMs on model registry platforms are plagued by a variety of inconsistencies, such as ambiguous naming conventions and inaccessible model training documentation. Given the knowledge gap on current PTLM release practices, our empirical study uses a mixed-methods approach to analyze the releases of 52,227 PTLMs on the most well-known model registry, HF. Our results reveal 148 different naming practices for PTLM releases, with 40.87% of changes to model weight files not represented in the adopted name-based versioning practice or their documentation. In addition, we identified that the 52,227 PTLMs are derived from only 299 different base models (the modified original models used to create 52,227 PTLMs), with Fine-tuning and Quantization being the most prevalent modification methods applied to these base models. Significant gaps in release transparency, in terms of training dataset specifications and model card availability, still exist, highlighting the need for standardized documentation. While we identified a model naming practice explicitly differentiating between major and minor PTLM releases, we did not find any significant difference in the types of changes that went into either type of releases, suggesting that major/minor version numbers for PTLMs often are chosen arbitrarily. Our findings provide valuable insights to improve PTLM release practices, nudging the field towards more formal semantic versioning practices.

Towards Semantic Versioning of Open Pre-trained Language Model Releases on Hugging Face

TL;DR

The paper investigates how open PTLM releases on Hugging Face are named, versioned, and documented, revealing pervasive inconsistencies and a lack of semantic versioning. Using a mixed-methods approach, it analyzes PTLMs, identifies naming conventions across segment-types, and shows that only a minority of releases explicitly signal variant-type or training-dataset provenance. It demonstrates widespread implicit versioning in model binaries and extensive gaps in model cards and dataset metadata, arguing for a multidimensional, provenance-rich semantic versioning framework for PTLMs. The work outlines concrete recommendations for standardizing naming, enhancing metadata, and providing tools (e.g., version calculators, SBOM-like practices) to improve reproducibility, trust, and interoperability in the model registry ecosystem.

Abstract

The proliferation of open Pre-trained Language Models (PTLMs) on model registry platforms like Hugging Face (HF) presents both opportunities and challenges for companies building products around them. Similar to traditional software dependencies, PTLMs continue to evolve after a release. However, the current state of release practices of PTLMs on model registry platforms are plagued by a variety of inconsistencies, such as ambiguous naming conventions and inaccessible model training documentation. Given the knowledge gap on current PTLM release practices, our empirical study uses a mixed-methods approach to analyze the releases of 52,227 PTLMs on the most well-known model registry, HF. Our results reveal 148 different naming practices for PTLM releases, with 40.87% of changes to model weight files not represented in the adopted name-based versioning practice or their documentation. In addition, we identified that the 52,227 PTLMs are derived from only 299 different base models (the modified original models used to create 52,227 PTLMs), with Fine-tuning and Quantization being the most prevalent modification methods applied to these base models. Significant gaps in release transparency, in terms of training dataset specifications and model card availability, still exist, highlighting the need for standardized documentation. While we identified a model naming practice explicitly differentiating between major and minor PTLM releases, we did not find any significant difference in the types of changes that went into either type of releases, suggesting that major/minor version numbers for PTLMs often are chosen arbitrarily. Our findings provide valuable insights to improve PTLM release practices, nudging the field towards more formal semantic versioning practices.
Paper Structure (39 sections, 14 figures, 7 tables)

This paper contains 39 sections, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Four different examples of how the model modification methods () are specified in the model names on repository.
  • Figure 2: Five different examples of model naming practices on . Some models have 5 segments, while some have less than 2 segments. Each of these examples indicates different information in the names, such as , , version, and size.
  • Figure 3: Data collection procedure
  • Figure 4: Visualization of labeled segments from naming convention segment types in 384 manually analyzed model names on . Each model name was broken down into 928 segments and labeled. The dataset , consisting of 928 labelsresulting in 12 segment types , is which are plotted with the label type (termed as the "element" in this context) on the y-axis and the , with their frequency of occurrences on the x-axis.
  • Figure 5: Visualization of labeled segments from 384 manually analyzed model names on . Each model name is composed of various segment types, as illustrated by the naming convention {base-model}{variant-type}{dataset}. The dataset, consisting of several segments in the 148 naming conventions, is plotted with the segment types on the y-axis and the number of times each segment type appears in these conventions on the x-axis.
  • ...and 9 more figures