Table of Contents
Fetching ...

Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face

Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg

TL;DR

The paper analyzes 1.86 million Hugging Face models as a large open ecosystem, modeling derivative relationships as family trees to study how traits such as licenses, languages, and tasks mutate and propagate. It employs an ecological/genetic framework, using metadata and model cards as semantic DNA and applying TF-IDF/BoW and Levenshtein measures to quantify similarity across related models. Key findings include fast, directed mutations, a surprising pattern where siblings are more similar than parent–child pairs, and systematic drifts toward permissive licenses and English-language support, alongside leaner, more automated documentation. The work provides an empirical baseline for understanding model fine-tuning dynamics, highlights environmental pressures shaping the ecosystem, and proposes that ecological methods can yield novel insights into AI diffusion and governance.

Abstract

Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling' models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license's terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights.

Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face

TL;DR

The paper analyzes 1.86 million Hugging Face models as a large open ecosystem, modeling derivative relationships as family trees to study how traits such as licenses, languages, and tasks mutate and propagate. It employs an ecological/genetic framework, using metadata and model cards as semantic DNA and applying TF-IDF/BoW and Levenshtein measures to quantify similarity across related models. Key findings include fast, directed mutations, a surprising pattern where siblings are more similar than parent–child pairs, and systematic drifts toward permissive licenses and English-language support, alongside leaner, more automated documentation. The work provides an empirical baseline for understanding model fine-tuning dynamics, highlights environmental pressures shaping the ecosystem, and proposes that ecological methods can yield novel insights into AI diffusion and governance.

Abstract

Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling' models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license's terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights.

Paper Structure

This paper contains 22 sections, 19 figures, 1 table.

Figures (19)

  • Figure 1: Family trees from the ecosystem dataset. Edges represent different forms of derivative models that are documented as having finetuned, quantized, adapter or merged existing models. Diffusion patterns reveal large broadcasts and numerous generations of derivatives. Graphs without merges are trees, meaning no model has more than one parent (upper left, upper right, and lower left). All graphs are directed and acyclic.
  • Figure 2: Top ten most frequent licenses, tasks, languages, and libraries (top row). Top ten models ranked by number of children, datasets, arXiv categories of linked papers, and downloaded models (bottom row).
  • Figure 4: The diff between two sequences of model metadata. We measure the overall mutation rate and genetic similarity by tracking rates of overlap and departure between these sequences. The metadata sequence depicted on top is that of Qwen/Qwen1.5-72B, the base model depicted in Figure \ref{['fig:growth-over-time']}; the bottom sequence is one of its finetunes. Additions are shown in green, deletions in red, and substitutions in yellow. This figure depicts character-level mutations corresponding most closely to the Levenshtein distance. We additionally measure and report similarity on term-level representations (using bag-of-words and TF-IDF), which we believe better captures categorical shifts in metadata.
  • Figure 5: Cosine similarity between TF-IDF embedding vectors, trained on terms appearing in the model metadata for all models in our dataset. Here, we sample finetunes meeting specific family structures. We enumerate all possible sub-trees of size 2 (B), 3 (C), and 4 (D), and enumerate all possible pairs of nodes within these sub-trees. When we compare these genetic similarities to the baseline of the similarity between any two nodes in the graph (A), we find that all observed family ties strongly predict attribute similarity. Similarities between pairs of models suggest that models are more related when they reside at similar depths and when they are topologically close in distance.
  • Figure 6: We observe that siblings exhibit greater similarity in traits than parent-child pairs. This implies not only that there is a high rate of mutation, but that mutations are sufficiently directed.
  • ...and 14 more figures

Theorems & Definitions (4)

  • Definition 6.1: Cosine similarity in term frequency
  • Definition 6.2: Cosine similarity in term frequency-inverse document frequency
  • Definition 6.3: Normalized Levenshtein Similarity
  • Definition 7.1: Mutation rate over traits T