Table of Contents
Fetching ...

ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins

Yichen Zhou, Jonathan Golob, Amir Karimi, Stefan Bauer, Patrick Schwab

TL;DR

ViroGym is introduced, a comprehensive benchmark designed to evaluate variant effect prediction in viral proteins and to facilitate selecting rational antigen candidates and show that pLMs selected using in vitro experimental data excel at predicting dominant circulating mutations in real world.

Abstract

Protein language models (pLMs) have shown strong potential in prediction of the functional effects of missense variants in zero-shot settings. Despite this progress, benchmarking pLMs for viral proteins remains limited and systematic strategies for integrating in silico metrics with in vitro validation to guide antigen and target selection are underdeveloped. Here, we introduce ViroGym, a comprehensive benchmark designed to evaluate variant effect prediction in viral proteins and to facilitate selecting rational antigen candidates. We curated 79 deep mutational scanning (DMS) assays encompassing eukaryotic viruses, collectively comprising 552,937 mutated amino acid sequences across 7 distinct phenotypic readouts, and 21 influenza virus neutralisation tasks and a real-world predictive task for SARS-CoV-2. We benchmark well-established pLMs on fitness landscapes, antigenic diversity, and pandemic forecasting to provide a framework for vaccine selection, and show that pLMs selected using in vitro experimental data excel at predicting dominant circulating mutations in real world.

ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins

TL;DR

ViroGym is introduced, a comprehensive benchmark designed to evaluate variant effect prediction in viral proteins and to facilitate selecting rational antigen candidates and show that pLMs selected using in vitro experimental data excel at predicting dominant circulating mutations in real world.

Abstract

Protein language models (pLMs) have shown strong potential in prediction of the functional effects of missense variants in zero-shot settings. Despite this progress, benchmarking pLMs for viral proteins remains limited and systematic strategies for integrating in silico metrics with in vitro validation to guide antigen and target selection are underdeveloped. Here, we introduce ViroGym, a comprehensive benchmark designed to evaluate variant effect prediction in viral proteins and to facilitate selecting rational antigen candidates. We curated 79 deep mutational scanning (DMS) assays encompassing eukaryotic viruses, collectively comprising 552,937 mutated amino acid sequences across 7 distinct phenotypic readouts, and 21 influenza virus neutralisation tasks and a real-world predictive task for SARS-CoV-2. We benchmark well-established pLMs on fitness landscapes, antigenic diversity, and pandemic forecasting to provide a framework for vaccine selection, and show that pLMs selected using in vitro experimental data excel at predicting dominant circulating mutations in real world.
Paper Structure (16 sections, 1 equation, 16 figures, 10 tables)

This paper contains 16 sections, 1 equation, 16 figures, 10 tables.

Figures (16)

  • Figure 1: ViroGym benchmark framework. The benchmark consists of two major components: in vitro experimental evaluation and real-world prediction tasks. The in vitro evaluation leverages experimental measurements from DMS assays and neutralisation assays to evaluate model performance on protein functional effects. The real-world component evaluates models on SARS-CoV-2 pandemic forecasting using viral sequence data from GISAID database, capturing model generalisation from controlled wet lab settings to natural viral evolution.
  • Figure 2: SARS-CoV-2 Spike Protein Mutation Heat Map. This heat map displays the frequency of 21 potential amino acid substitutions across 1273 residues of the SARS-CoV-2 Spike protein, with colour intensity indicating mutation frequency at each position. Data were collected from the GISAID database between January 2020 and May 2025.
  • Figure 3: Task-wise comparison of ESM2 15B and ProGen2-XL on the DMS benchmark. ESM2 15B scores are computed using the semantic scoring strategy, while ProGen2-XL scores use the negative log-likelihood strategy. Reported values represent the absolute Spearman’s rank correlation between model fitness scores and experimental measurements.
  • Figure 4: Task-wise performance of Tranception M on the neutralisation benchmark. Antigenicity scores are computed using the negative log-likelihood strategy. Each task corresponds to a vaccine strain representing post-vaccination serum, with colors indicating influenza A subtypes. Performance is measured as the absolute Spearman’s rank correlation between predicted antigenicity scores and experimental measurements, averaged across sera from different animal sources.
  • Figure 5: Overlap among top 10 mutations from computational predictions, in vitro DMS assays of the SARS-CoV-2 S protein RBD, and naturally occurring mutations from GISAID. a. ESM2-650M predictions show no overlap with DMS or GISAID mutations. b. Predicted mutations from ProGen2-XL overlap 50% with GISAID and 20% with DMS.
  • ...and 11 more figures