LLM DNA: Tracing Model Evolution via Functional Representations
Zhaomin Wu, Haodong Zhao, Ziyang Wang, Jizhou Guo, Qian Wang, Bingsheng He
TL;DR
The paper formalizes LLM DNA as a low-dimensional, intrinsic representation of an LLM's functional behavior, achieved via a bi-Lipschitz embedding from the LLM function space to a DNA space, with existence guaranteed by the Johnson-Lindenstrauss lemma. It provides RepTrace, a training-free pipeline that uses semantic embeddings and random projections to extract DNA from diverse LLMs, enabling scalable lineage analysis. Empirical results across 305 models show DNAs can detect known and undocumented relationships, enable training-free model routing, and yield phylogenetic trees that reflect architectural shifts and temporal evolution while exposing varying evolutionary speeds. The work offers a data-driven tool for model governance, provenance, and risk mitigation, along with clear directions for robustness, bias, and security considerations.
Abstract
The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies inheritance and genetic determinism properties and establish the existence of DNA. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on specific tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.
