Table of Contents
Fetching ...

A Phylogenetic Approach to Genomic Language Modeling

Carlos Albors, Jianan Canal Li, Gonzalo Benegas, Chengzhong Ye, Yun S. Song

TL;DR

This work tackles the limited zero-shot performance of genomic language models in identifying evolutionarily constrained regions by introducing PhyloGPN, a model trained to simulate nucleotide evolution on phylogenetic trees using a loss derived from the $F81$ substitution model and whole-genome alignments. Training integrates phylogenetic data, but inference does not require MSAs, enabling broader applicability and transfer learning. PhyloGPN achieves state-of-the-art or competitive results on multiple transfer-learning benchmarks and variant-effect tasks, notably excelling on ClinVar classifications and Disease VEP in the BEND suite, while also providing strong embedding-based evaluations. The approach bridges classical phylogenetics with modern neural language modeling to improve cross-species interpretability of genomes and points to future gains from richer evolutionary models like $GTR$ and from incorporating gene trees.

Abstract

Genomic language models (gLMs) have shown mostly modest success in identifying evolutionarily constrained elements in mammalian genomes. To address this issue, we introduce a novel framework for training gLMs that explicitly models nucleotide evolution on phylogenetic trees using multispecies whole-genome alignments. Our approach integrates an alignment into the loss function during training but does not require it for making predictions, thereby enhancing the model's applicability. We applied this framework to train PhyloGPN, a model that excels at predicting functionally disruptive variants from a single sequence alone and demonstrates strong transfer learning capabilities.

A Phylogenetic Approach to Genomic Language Modeling

TL;DR

This work tackles the limited zero-shot performance of genomic language models in identifying evolutionarily constrained regions by introducing PhyloGPN, a model trained to simulate nucleotide evolution on phylogenetic trees using a loss derived from the substitution model and whole-genome alignments. Training integrates phylogenetic data, but inference does not require MSAs, enabling broader applicability and transfer learning. PhyloGPN achieves state-of-the-art or competitive results on multiple transfer-learning benchmarks and variant-effect tasks, notably excelling on ClinVar classifications and Disease VEP in the BEND suite, while also providing strong embedding-based evaluations. The approach bridges classical phylogenetics with modern neural language modeling to improve cross-species interpretability of genomes and points to future gains from richer evolutionary models like and from incorporating gene trees.

Abstract

Genomic language models (gLMs) have shown mostly modest success in identifying evolutionarily constrained elements in mammalian genomes. To address this issue, we introduce a novel framework for training gLMs that explicitly models nucleotide evolution on phylogenetic trees using multispecies whole-genome alignments. Our approach integrates an alignment into the loss function during training but does not require it for making predictions, thereby enhancing the model's applicability. We applied this framework to train PhyloGPN, a model that excels at predicting functionally disruptive variants from a single sequence alone and demonstrates strong transfer learning capabilities.

Paper Structure

This paper contains 21 sections, 1 theorem, 6 equations, 7 figures, 4 tables.

Key Result

proposition thmcounterproposition

The inequality $1 - e^{-e^x} > \frac{e^x}{1 + e^x}$ holds for all $x \in \mathbb{R}.$

Figures (7)

  • Figure 1: Illustration of PhyloGPN's modeling framework. The input data consist of 481 bp windows from the human reference genome GRCh38 and the alignment columns are obtained from a whole-genome alignment of 447 mammalian species to GRCh38.
  • Figure 2: Architecture of PhyloGPN. A convolutional layer with $N$ input channels, $M$ output channels, a kernel size $K$, and a dilation rate $R$ is noted as having hyperparameters $[N, M, K, R]$. The acronym "RCE" stands for "reverse-complement equivariant."
  • Figure 3: Results on predicting clinical labels from LLRs. (a) Results on ClinVar. (b) Results on classifying pathogenic regulatory variants. We show AUPRC results for various negative sets of SNPs above a minimum MAF threshold.
  • Figure 4: Comparison of results for PhyloGPN and baseline models on the task of ranking substitutions in DMS experiments. Each point corresponds to an experiment. The dotted lines are $y = x$ lines.
  • Figure 5: Confusion matrices for the Gene Finding task for PhyloGPN and PhyloGPN-X. Entries in cells are the percentage of instances with a true label that were predicted to have a given label. The acronyms "FS" and "RS" stand for "forward strand" and "reverse stand," respectively.
  • ...and 2 more figures

Theorems & Definitions (2)

  • proposition thmcounterproposition
  • proof