Table of Contents
Fetching ...

Emergent evaluation hubs in a decentralizing large language model ecosystem

Manuel Cebrian, Tomomi Kito, Raul Castro Fernandez

TL;DR

The paper investigates how the expanding foundation-model ecosystem co-evolves with evaluative benchmarks, using the Stanford Foundation-Model Ecosystem Graph and Evidently AI benchmark registry from 2019–2025. It employs network analysis and an agent-based model to reveal a decentered, multi-origin model production landscape alongside a highly centralized benchmark authority, with the top actors providing shared reference points for evaluation. A key finding is that increasing the rate of novel benchmark entry reduces concentration, while penalties for re-using benchmarks have limited impact, highlighting a trade-off between coordination and path dependence. These insights inform governance and transparency efforts, suggesting that a broader, auditable suite of benchmarks can improve coverage without sacrificing the standardization benefits that centralized benchmarks provide.

Abstract

Large language models are proliferating, and so are the benchmarks that serve as their common yardsticks. We ask how the agglomeration patterns of these two layers compare: do they evolve in tandem or diverge? Drawing on two curated proxies for the ecosystem, the Stanford Foundation-Model Ecosystem Graph and the Evidently AI benchmark registry, we find complementary but contrasting dynamics. Model creation has broadened across countries and organizations and diversified in modality, licensing, and access. Benchmark influence, by contrast, displays centralizing patterns: in the inferred benchmark-author-institution network, the top 15% of nodes account for over 80% of high-betweenness paths, three countries produce 83% of benchmark outputs, and the global Gini for inferred benchmark authority reaches 0.89. An agent-based simulation highlights three mechanisms: higher entry of new benchmarks reduces concentration; rapid inflows can temporarily complicate coordination in evaluation; and stronger penalties against over-fitting have limited effect. Taken together, these results suggest that concentrated benchmark influence functions as coordination infrastructure that supports standardization, comparability, and reproducibility amid rising heterogeneity in model production, while also introducing trade-offs such as path dependence, selective visibility, and diminishing discriminative power as leaderboards saturate.

Emergent evaluation hubs in a decentralizing large language model ecosystem

TL;DR

The paper investigates how the expanding foundation-model ecosystem co-evolves with evaluative benchmarks, using the Stanford Foundation-Model Ecosystem Graph and Evidently AI benchmark registry from 2019–2025. It employs network analysis and an agent-based model to reveal a decentered, multi-origin model production landscape alongside a highly centralized benchmark authority, with the top actors providing shared reference points for evaluation. A key finding is that increasing the rate of novel benchmark entry reduces concentration, while penalties for re-using benchmarks have limited impact, highlighting a trade-off between coordination and path dependence. These insights inform governance and transparency efforts, suggesting that a broader, auditable suite of benchmarks can improve coverage without sacrificing the standardization benefits that centralized benchmarks provide.

Abstract

Large language models are proliferating, and so are the benchmarks that serve as their common yardsticks. We ask how the agglomeration patterns of these two layers compare: do they evolve in tandem or diverge? Drawing on two curated proxies for the ecosystem, the Stanford Foundation-Model Ecosystem Graph and the Evidently AI benchmark registry, we find complementary but contrasting dynamics. Model creation has broadened across countries and organizations and diversified in modality, licensing, and access. Benchmark influence, by contrast, displays centralizing patterns: in the inferred benchmark-author-institution network, the top 15% of nodes account for over 80% of high-betweenness paths, three countries produce 83% of benchmark outputs, and the global Gini for inferred benchmark authority reaches 0.89. An agent-based simulation highlights three mechanisms: higher entry of new benchmarks reduces concentration; rapid inflows can temporarily complicate coordination in evaluation; and stronger penalties against over-fitting have limited effect. Taken together, these results suggest that concentrated benchmark influence functions as coordination infrastructure that supports standardization, comparability, and reproducibility amid rising heterogeneity in model production, while also introducing trade-offs such as path dependence, selective visibility, and diminishing discriminative power as leaderboards saturate.

Paper Structure

This paper contains 9 sections, 5 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Growth of the foundation-model ecosystem. (a) Annual and cumulative model releases, 2019–early 2025 (2025 is partial-year). (b) Reported parameter counts (log scale), 2019–2025. (c) New and cumulative manufacturers per year; over 160 organizations by early 2025.
  • Figure 2: Documentation and access trends, 2019–2025. (a) Fraction of new models disclosing training emissions (blue), training time (red), training hardware (green), structured model cards (purple), and explicit parameter counts (orange). All metrics peak in 2020 then decline; size reporting remains at $\approx 60\%$ by 2024. (b) Access conditions for foundation models, 2019–2025. Top panel: License status of newly released models, binned as permissive open source (green), partially open or “community” licenses such as LLaMA 2 (blue), fully closed licenses (red), and cases where the license is not disclosed (gray). Bottom panel: Availability of pre-trained weights, recorded as openly downloadable (green), gated or paywalled (red), or unspecified (gray). The share of fully open licenses and weights plummets after 2019, bottoms out in 2021, and then recovers only partially—never exceeding 45–50% of annual releases. Closed or ambiguous terms remain common, indicating that rapid ecosystem growth has not been matched by equivalent gains in access transparency.
  • Figure 3: Corporate footprint and concentration patterns, 2019–2025. (a) Shifting corporate footprint: stacked bars (left axis) show, by release year, the number of producing organisations classified as start-up (green), medium-sized (blue), large (red), or unknown (gray); superimposed lines (right axis) plot private (purple) and publicly traded (orange) entrants. Pre-2021 activity is negligible and driven by large public firms. The ecosystem broadens in 2022 and peaks in 2023 with over 180 distinct companies ($\approx ~30$ start-ups). In 2024 total producers dip modestly while private entrants keep rising and public-company entries fall sharply. (2025 is partial-year.) (b) Concentration and long tail: treemap area (and color shading) is proportional to the number of distinct models per aggregated organisation. “Others” groups 203 models across more than 100 smaller actors.
  • Figure 4: We applied PCA to eight yearly $z$-scored metrics; the top two components explain about 81% of the variance. PC1 reflects overall growth—more models, larger size, and more manufacturers and countries. PC2 reflects openness, with higher values for more modalities and lower ones for poor documentation and closed weights.
  • Figure 5: Benchmark-ecosystem growth, 2016–2025. (a) Annual releases and cumulative stock surpass 100 benchmark suites by 2024. (b) Benchmark categories widen from one in 2016 to fourteen by 2024, with five new types in 2023 alone. (c) Citations top 75,000 in 2024, spiking around landmark suites in 2018, 2021, and 2023. (d) Author participation accelerates after 2020—both total and unique contributors—pushing the cumulative author pool sharply upward.
  • ...and 6 more figures