Table of Contents
Fetching ...

PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks

Nicolas Yax, Pierre-Yves Oudeyer, Stefano Palminteri

TL;DR

PhyloLM repurposes phylogenetic methods to the study of Large Language Models by treating prompts as genes and tokens as alleles, enabling distance-based dendrograms that reveal relationships among 111 open and 45 closed models. The approach uses Nei-like distances and Neighbour Joining to infer model genealogies and demonstrates that the resulting phylogenetic distance can predict benchmark performance with substantial accuracy, offering a cost-efficient alternative to exhaustive benchmarking. The work highlights clear clustering by model families, shows sensitivity to prompting modality (completion vs. chat), and discusses limitations related to common ancestors and tokenization, while outlining future directions such as more gene sets and broader evaluation. Overall, PhyloLM provides a scalable framework to understand LLM evolution and functional capabilities in environments with limited access to training details.

Abstract

This paper introduces PhyloLM, a method adapting phylogenetic algorithms to Large Language Models (LLMs) to explore whether and how they relate to each other and to predict their performance characteristics. Our method calculates a phylogenetic distance metric based on the similarity of LLMs' output. The resulting metric is then used to construct dendrograms, which satisfactorily capture known relationships across a set of 111 open-source and 45 closed models. Furthermore, our phylogenetic distance predicts performance in standard benchmarks, thus demonstrating its functional validity and paving the way for a time and cost-effective estimation of LLM capabilities. To sum up, by translating population genetic concepts to machine learning, we propose and validate a tool to evaluate LLM development, relationships and capabilities, even in the absence of transparent training information.

PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks

TL;DR

PhyloLM repurposes phylogenetic methods to the study of Large Language Models by treating prompts as genes and tokens as alleles, enabling distance-based dendrograms that reveal relationships among 111 open and 45 closed models. The approach uses Nei-like distances and Neighbour Joining to infer model genealogies and demonstrates that the resulting phylogenetic distance can predict benchmark performance with substantial accuracy, offering a cost-efficient alternative to exhaustive benchmarking. The work highlights clear clustering by model families, shows sensitivity to prompting modality (completion vs. chat), and discusses limitations related to common ancestors and tokenization, while outlining future directions such as more gene sets and broader evaluation. Overall, PhyloLM provides a scalable framework to understand LLM evolution and functional capabilities in environments with limited access to training details.

Abstract

This paper introduces PhyloLM, a method adapting phylogenetic algorithms to Large Language Models (LLMs) to explore whether and how they relate to each other and to predict their performance characteristics. Our method calculates a phylogenetic distance metric based on the similarity of LLMs' output. The resulting metric is then used to construct dendrograms, which satisfactorily capture known relationships across a set of 111 open-source and 45 closed models. Furthermore, our phylogenetic distance predicts performance in standard benchmarks, thus demonstrating its functional validity and paving the way for a time and cost-effective estimation of LLM capabilities. To sum up, by translating population genetic concepts to machine learning, we propose and validate a tool to evaluate LLM development, relationships and capabilities, even in the absence of transparent training information.
Paper Structure (47 sections, 2 equations, 33 figures, 5 tables, 1 algorithm)

This paper contains 47 sections, 2 equations, 33 figures, 5 tables, 1 algorithm.

Figures (33)

  • Figure 1: Analogy between running human genetic studies and LLMs genetic studies. The first stage consists in selecting genes (for both humans and LLMs). Then alleles are collected for each individual in the population and will be used to compare the populations (either populations of humans or LLMs seen as populations). Finally these data go through the Nei distance computation nei that returns a distance matrix that can then be turned into dendrograms using the NJ algorithm njtree in the same way for both humans and LLMs.
  • Figure 2: Hyperparameters impact on distance matrices in the math set of genes (a) shows the variability of distance matrices for different number of genes G and number of probes N in the math benchmark. Each set of genes of specified size contains different and independent genes from the other matrices for a total of 8 distance matrix for each data point in the figure. (b) shows the distance to the high precision matrix made of 2048 genes and N=128 in the math benchmark. Errorbars represent the standard error of the mean.
  • Figure 3: Phylogenetic tree reconstruction. On the left it is shown the ground truth concerning the relation of some LLMs of the Mistral family. Right is the reconstruction from the phylogenetic algorithm on the 'math' set of genes for the five latest models of this family ("leaves" of the phylogenetic tree) on which we run PhyloLM. On the right, it is shown the reconstructed phylogenetic tree PhyloLM on the 5 "leafs" models. The numerical labels (0:3) map the true common ancestors (on the right, "ground truth") to the inferred ones (on the left, "reconstructed"). It can be seen that the true and the reconstructed trees are topologically equivalent
  • Figure 4: Inferred phylogenetic tree of LLMs on the 'math' set of genes. (a) completion models inlcude all open source models included in our study and the 14 openai completion models (b) chat models include additional proprietary models. Completion and chat models were separated because they are not comparable due to additional prompting from the API. Llama models have been split by version of the pretrained model and the number of parameters.
  • Figure 5: Predictions from the logistic regression compared to ground truth for every model (leave one family out method) on ARC benchmark. (a) Scatter plot showing the fitting of the logistic regression on all models but the OPT family (in grey) and the prediction of OPT performance by the regression (in red). (b) Predictions from the logistic regression for each family. To predict a family, the regressor fits on all the other families to finally predict the score of the models from the remaining family (leave one family out method - see (a)).
  • ...and 28 more figures