Table of Contents
Fetching ...

LLM DNA: Tracing Model Evolution via Functional Representations

Zhaomin Wu, Haodong Zhao, Ziyang Wang, Jizhou Guo, Qian Wang, Bingsheng He

TL;DR

The paper formalizes LLM DNA as a low-dimensional, intrinsic representation of an LLM's functional behavior, achieved via a bi-Lipschitz embedding from the LLM function space to a DNA space, with existence guaranteed by the Johnson-Lindenstrauss lemma. It provides RepTrace, a training-free pipeline that uses semantic embeddings and random projections to extract DNA from diverse LLMs, enabling scalable lineage analysis. Empirical results across 305 models show DNAs can detect known and undocumented relationships, enable training-free model routing, and yield phylogenetic trees that reflect architectural shifts and temporal evolution while exposing varying evolutionary speeds. The work offers a data-driven tool for model governance, provenance, and risk mitigation, along with clear directions for robustness, bias, and security considerations.

Abstract

The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies inheritance and genetic determinism properties and establish the existence of DNA. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on specific tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.

LLM DNA: Tracing Model Evolution via Functional Representations

TL;DR

The paper formalizes LLM DNA as a low-dimensional, intrinsic representation of an LLM's functional behavior, achieved via a bi-Lipschitz embedding from the LLM function space to a DNA space, with existence guaranteed by the Johnson-Lindenstrauss lemma. It provides RepTrace, a training-free pipeline that uses semantic embeddings and random projections to extract DNA from diverse LLMs, enabling scalable lineage analysis. Empirical results across 305 models show DNAs can detect known and undocumented relationships, enable training-free model routing, and yield phylogenetic trees that reflect architectural shifts and temporal evolution while exposing varying evolutionary speeds. The work offers a data-driven tool for model governance, provenance, and risk mitigation, along with clear directions for robustness, bias, and security considerations.

Abstract

The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies inheritance and genetic determinism properties and establish the existence of DNA. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on specific tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.

Paper Structure

This paper contains 47 sections, 9 theorems, 19 equations, 10 figures, 11 tables, 1 algorithm.

Key Result

Theorem 3.3

For any finite set of $K$ LLMs $\mathcal{F}_K=\{f_1, \dots, f_K\} \subset \mathcal{F}$, a DNA representation (Definition def:dna_embedding) with DNA space $\mathcal{D} = \mathbb{R}^L$ exists, satisfying for all $f_i, f_j\in\mathcal{F}_K$, $c_1\cdot d_H(f_i, f_j) \le d_{\tau}(\tau_{f_i}, \tau_{f_j})

Figures (10)

  • Figure 1: Visualization of RepTrace: LLM DNA extraction workflow
  • Figure 2: DNA distribution of LLMs evaluated by zhuindependence. "Independent" and "Correlated" relative to Llama-2-7B-hf are based on public documents. The boundary is computed by an SVM with RBF kernel, indicating that the DNAs of "Independent" and "Correlated" models are clearly separated.
  • Figure 3: Mantel test between DNA extracted from two disjoint datasets. Each point represents a single pair of models, plotted by their distance in the first dataset versus their distance in the second, showing a strong correlation ($\text{Pearson-$r$}=0.7797$) and high statistical significance ($\text{$p$-value}<0.0001$).
  • Figure 4: Visualization of DNAs by t-SNE. Colors denote organizations releasing LLMs. Organizations with fewer than five LLMs are collapsed into "Others". Background regions are obtained by localized DBSCAN started where each organization forms a group of more than three models.
  • Figure 5: Shift of DNA values when (full-parameter) fine-tuning Llama3-8B-Instruct with OpenMathInstruct-2 subsets of size $n$
  • ...and 5 more figures

Theorems & Definitions (19)

  • Definition 3.1: Large Language Model
  • Definition 3.2: DNA of an LLM
  • Theorem 3.3: Existence of LLM DNA
  • Corollary 3.3: Construct LLM DNA via Random Projection
  • Definition 4.1: Stochastic Functional Distance
  • Lemma 4.1: Concentration of Empirical Functional Distance
  • Definition A.1: Evolution
  • Theorem A.2: Inheritance
  • proof
  • Theorem A.3: Genetic Determinism
  • ...and 9 more