PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks
Nicolas Yax, Pierre-Yves Oudeyer, Stefano Palminteri
TL;DR
PhyloLM repurposes phylogenetic methods to the study of Large Language Models by treating prompts as genes and tokens as alleles, enabling distance-based dendrograms that reveal relationships among 111 open and 45 closed models. The approach uses Nei-like distances and Neighbour Joining to infer model genealogies and demonstrates that the resulting phylogenetic distance can predict benchmark performance with substantial accuracy, offering a cost-efficient alternative to exhaustive benchmarking. The work highlights clear clustering by model families, shows sensitivity to prompting modality (completion vs. chat), and discusses limitations related to common ancestors and tokenization, while outlining future directions such as more gene sets and broader evaluation. Overall, PhyloLM provides a scalable framework to understand LLM evolution and functional capabilities in environments with limited access to training details.
Abstract
This paper introduces PhyloLM, a method adapting phylogenetic algorithms to Large Language Models (LLMs) to explore whether and how they relate to each other and to predict their performance characteristics. Our method calculates a phylogenetic distance metric based on the similarity of LLMs' output. The resulting metric is then used to construct dendrograms, which satisfactorily capture known relationships across a set of 111 open-source and 45 closed models. Furthermore, our phylogenetic distance predicts performance in standard benchmarks, thus demonstrating its functional validity and paving the way for a time and cost-effective estimation of LLM capabilities. To sum up, by translating population genetic concepts to machine learning, we propose and validate a tool to evaluate LLM development, relationships and capabilities, even in the absence of transparent training information.
