A Phylogenetic Approach to Genomic Language Modeling
Carlos Albors, Jianan Canal Li, Gonzalo Benegas, Chengzhong Ye, Yun S. Song
TL;DR
This work tackles the limited zero-shot performance of genomic language models in identifying evolutionarily constrained regions by introducing PhyloGPN, a model trained to simulate nucleotide evolution on phylogenetic trees using a loss derived from the $F81$ substitution model and whole-genome alignments. Training integrates phylogenetic data, but inference does not require MSAs, enabling broader applicability and transfer learning. PhyloGPN achieves state-of-the-art or competitive results on multiple transfer-learning benchmarks and variant-effect tasks, notably excelling on ClinVar classifications and Disease VEP in the BEND suite, while also providing strong embedding-based evaluations. The approach bridges classical phylogenetics with modern neural language modeling to improve cross-species interpretability of genomes and points to future gains from richer evolutionary models like $GTR$ and from incorporating gene trees.
Abstract
Genomic language models (gLMs) have shown mostly modest success in identifying evolutionarily constrained elements in mammalian genomes. To address this issue, we introduce a novel framework for training gLMs that explicitly models nucleotide evolution on phylogenetic trees using multispecies whole-genome alignments. Our approach integrates an alignment into the loss function during training but does not require it for making predictions, thereby enhancing the model's applicability. We applied this framework to train PhyloGPN, a model that excels at predicting functionally disruptive variants from a single sequence alone and demonstrates strong transfer learning capabilities.
