Table of Contents
Fetching ...

PETRA: Pretrained Evolutionary Transformer for SARS-CoV-2 Mutation Prediction

Xu Zou

TL;DR

PETRA redefines SARS-CoV-2 mutation prediction by training a decoder-only transformer on evolutionary trajectories derived from phylogenetic trees, reducing sequencing noise that plagues RNA-based models. The approach integrates a unified mutation trajectory tokenizer with weighted sampling to compensate for geographical and temporal data imbalances, enabling robust short-horizon mutation predictions. In rigorous evaluation, PETRA achieves substantial gains over Bloom estimators, with weighted recall@1 reaching $9.45\%$ for nucleotide mutations and $17.10\%$ for spike amino-acid mutations, and demonstrates real-time mutation forecasting for major clades like 24F(XEC) and 25A(LP.8.1). Limitations include an inability to model recombination and other viral features, plus persistent data-imbalance challenges; future work aims to handle recombination and extend phenotypic predictions to enhance public health utility.

Abstract

Since its emergence, SARS-CoV-2 has demonstrated a rapid and unpredictable evolutionary trajectory, characterized by the continual emergence of immune-evasive variants. This poses persistent challenges to public health and vaccine development. While large-scale generative pre-trained transformers (GPTs) have revolutionized the modeling of sequential data, their direct applications to noisy viral genomic sequences are limited. In this paper, we introduce PETRA(Pretrained Evolutionary TRAnsformer), a novel transformer approach based on evolutionary trajectories derived from phylogenetic trees rather than raw RNA sequences. This method effectively mitigates sequencing noise and captures the hierarchical structure of viral evolution. With a weighted training framework to address substantial geographical and temporal imbalances in global sequence data, PETRA excels in predicting future SARS-CoV-2 mutations, achieving a weighted recall@1 of 9.45% for nucleotide mutations and 17.10\% for spike amino-acid mutations, compared to 0.49% and 6.64% respectively for the best baseline. PETRA also demonstrates its ability to aid in the real-time mutation prediction of major clades like 24F(XEC) and 25A(LP.8.1). The code is open sourced on https://github.com/xz-keg/PETra

PETRA: Pretrained Evolutionary Transformer for SARS-CoV-2 Mutation Prediction

TL;DR

PETRA redefines SARS-CoV-2 mutation prediction by training a decoder-only transformer on evolutionary trajectories derived from phylogenetic trees, reducing sequencing noise that plagues RNA-based models. The approach integrates a unified mutation trajectory tokenizer with weighted sampling to compensate for geographical and temporal data imbalances, enabling robust short-horizon mutation predictions. In rigorous evaluation, PETRA achieves substantial gains over Bloom estimators, with weighted recall@1 reaching for nucleotide mutations and for spike amino-acid mutations, and demonstrates real-time mutation forecasting for major clades like 24F(XEC) and 25A(LP.8.1). Limitations include an inability to model recombination and other viral features, plus persistent data-imbalance challenges; future work aims to handle recombination and extend phenotypic predictions to enhance public health utility.

Abstract

Since its emergence, SARS-CoV-2 has demonstrated a rapid and unpredictable evolutionary trajectory, characterized by the continual emergence of immune-evasive variants. This poses persistent challenges to public health and vaccine development. While large-scale generative pre-trained transformers (GPTs) have revolutionized the modeling of sequential data, their direct applications to noisy viral genomic sequences are limited. In this paper, we introduce PETRA(Pretrained Evolutionary TRAnsformer), a novel transformer approach based on evolutionary trajectories derived from phylogenetic trees rather than raw RNA sequences. This method effectively mitigates sequencing noise and captures the hierarchical structure of viral evolution. With a weighted training framework to address substantial geographical and temporal imbalances in global sequence data, PETRA excels in predicting future SARS-CoV-2 mutations, achieving a weighted recall@1 of 9.45% for nucleotide mutations and 17.10\% for spike amino-acid mutations, compared to 0.49% and 6.64% respectively for the best baseline. PETRA also demonstrates its ability to aid in the real-time mutation prediction of major clades like 24F(XEC) and 25A(LP.8.1). The code is open sourced on https://github.com/xz-keg/PETra

Paper Structure

This paper contains 34 sections, 3 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Left: An example phylogenetic tree. The virus starts from a root and gains mutations in sequential orders. Different branches may share some convergent mutations. Lineages with a sufficient number of mutations are designated as variants. Right: List of mutations for each sequence on the left phylogenetic tree. Each sequence has shared variant mutations and own private sequence mutations.
  • Figure 2: The training and inference of PETRA. Each Sequence is encoded to location time information, variant mutations and sequence mutations. During training, we compute loss on full evolution trajectory. During inference and evaluation, we predict the sequence mutations.
  • Figure 3: Distribution of SARS-CoV-2 sequences by country type. Developing and least developed countries are seriously underrepresented.
  • Figure 4: Distributed weighted sampling process of PETRA. Each worker maintains a local accumulator $l$. For each random sequence received from the data pool, it accumulates the sequence's probability $p$ to $l$. Only sequences that make $l$ meet or exceeds the next integer are selected.
  • Figure 5: Performance of PETRA under different immune backgrounds. Left: Weighted recall@$k$ with and without location information. Right: Weighted recall@$k$ by different sample time.
  • ...and 1 more figures