Table of Contents
Fetching ...

Founder effects shape the evolutionary dynamics of multimodality in open LLM families

Manuel Cebrian

Abstract

Large language model (LLM) families are improving rapidly, yet it remains unclear how quickly multimodal capabilities emerge and propagate within open families. Using the ModelBiome AI Ecosystem dataset of Hugging Face model metadata and recorded lineage fields (>1.8x10^6 model entries), we quantify multimodality over time and along recorded parent-to-child relations. Cross-modal tasks are widespread in the broader ecosystem well before they become common within major open LLM families: within these families, multimodality remains rare through 2023 and most of 2024, then increases sharply in 2024-2025 and is dominated by image-text vision-language tasks. Across major families, the first vision-language model (VLM) variants typically appear months after the first text-generation releases, with lags ranging from ~1 month (Gemma) to more than a year for several families and ~26 months for GLM. Lineage-conditioned transition rates show weak cross-type transfer: among fine-tuning edges from text-generation parents, only 0.218% yield VLM descendants. Instead, multimodality expands primarily within existing VLM lineages: 94.5% of VLM-child fine-tuning edges originate from VLM parents, versus 4.7% from text-generation parents. At the model level, most VLM releases appear as new roots without recorded parents (~60%), while the remainder are predominantly VLM-derived; founder concentration analyses indicate rapid within-lineage amplification followed by diversification. Together, these results show that multimodality enters open LLM families through rare founder events and then expands rapidly within their descendant lineages, producing punctuated adoption dynamics that likely induce distinct, transfer-limited scaling behavior for multimodal capabilities.

Founder effects shape the evolutionary dynamics of multimodality in open LLM families

Abstract

Large language model (LLM) families are improving rapidly, yet it remains unclear how quickly multimodal capabilities emerge and propagate within open families. Using the ModelBiome AI Ecosystem dataset of Hugging Face model metadata and recorded lineage fields (>1.8x10^6 model entries), we quantify multimodality over time and along recorded parent-to-child relations. Cross-modal tasks are widespread in the broader ecosystem well before they become common within major open LLM families: within these families, multimodality remains rare through 2023 and most of 2024, then increases sharply in 2024-2025 and is dominated by image-text vision-language tasks. Across major families, the first vision-language model (VLM) variants typically appear months after the first text-generation releases, with lags ranging from ~1 month (Gemma) to more than a year for several families and ~26 months for GLM. Lineage-conditioned transition rates show weak cross-type transfer: among fine-tuning edges from text-generation parents, only 0.218% yield VLM descendants. Instead, multimodality expands primarily within existing VLM lineages: 94.5% of VLM-child fine-tuning edges originate from VLM parents, versus 4.7% from text-generation parents. At the model level, most VLM releases appear as new roots without recorded parents (~60%), while the remainder are predominantly VLM-derived; founder concentration analyses indicate rapid within-lineage amplification followed by diversification. Together, these results show that multimodality enters open LLM families through rare founder events and then expands rapidly within their descendant lineages, producing punctuated adoption dynamics that likely induce distinct, transfer-limited scaling behavior for multimodal capabilities.
Paper Structure (7 sections, 1 equation, 4 figures, 2 tables)

This paper contains 7 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Multimodality appears earlier in the broader ecosystem than within major open LLM families. Main panel: for each month, the share of newly created checkpoints in major open LLM families tagged with any cross-modal task (text paired with image/audio/video; blue) and the share tagged specifically with image--text vision--language tasks (orange). Inset: corresponding ecosystem-wide reference series for all task-tagged Hugging Face models (dashed) and for transformers models excluding diffusion-oriented pipelines (dotted). LLM families are identified by name-based model_id patterns within transformers (excluding diffusers); months with low volume are omitted (e.g., $n<300$). Shaded bands show 95% Wilson score confidence intervals.
  • Figure 2: Task transitions via fine-tuning edges. Heatmap of parent$\rightarrow$child task-tag transitions along recorded finetune_parent relations (log scale; $\log_{10}(n_{\mathrm{edges}}+1)$). Rows correspond to the parent model’s pipeline_tag, and columns to the child model’s pipeline_tag. The pronounced diagonal structure indicates that fine-tuning is predominantly task-preserving, with especially strong text-generation$\rightarrow$text-generation continuity. Off-diagonal entries are comparatively sparse, revealing that cross-task transitions are rare. Short axis labels denote Hugging Face pipeline_tag abbreviations: txt-gen (text-generation), txt-cls (text-classification), tok-cls (token-classification), ASR (automatic-speech-recognition), QA (question-answering), summ (summarization), trans (translation), feat (feature-extraction), mask (fill-mask), img$\rightarrow$txt (image-to-text), img+txt$\rightarrow$txt (image-text-to-text), and VQA (visual-question-answering).
  • Figure 3: Asymmetric dynamics under fine-tuning: rare text$\rightarrow$VLM emergence but high VLM$\rightarrow$VLM retention. Left: Monthly estimates of $P(\mathrm{child\ is\ VLM}\mid \mathrm{parent=text\hbox{-}generation},\,\mathrm{relation=finetune})$, computed over recorded fine-tuning edges and binned by the child model’s createdAt month. Right: Monthly estimates of $P(\mathrm{child\ is\ VLM}\mid \mathrm{parent=VLM},\,\mathrm{relation=finetune})$ (VLM retention), binned by child createdAt month. Shaded bands denote 95% Wilson score confidence intervals for binomial proportions. Text-to-VLM transitions remain near zero with only transient increases, whereas fine-tuning from VLM parents typically preserves VLM status, indicating strong path dependence in modality along lineage edges.
  • Figure 4: Founder-driven expansion within VLM lineages. (A) Backbone lag to first VLM within major open LLM families, measured as months between the first text-generation release in the family and the first VLM-tagged release. (B) Model-level lineage channels for VLM releases: "root" models have no recorded parent; remaining VLMs are grouped by whether a recorded parent is VLM-, text-, or other-task-tagged (unresolved-parent cases shown separately). (C) Concentration of VLM$\rightarrow$VLM fine-tuning descent: for each child createdAt month, the share of VLM$\rightarrow$VLM fine-tune edges attributable to the single most prolific parent (top-1) and the three most prolific parents (top-3). (D) Founder diversity over time, measured as the effective number of parent checkpoints $N_{\mathrm{eff}}=1/\mathrm{HHI}$ computed from the monthly distribution of VLM$\rightarrow$VLM fine-tune parent IDs. All panels use the ModelBiome AI Ecosystem dataset (July 2025 snapshot).