Mitigating the Antigenic Data Bottleneck: Semi-supervised Learning with Protein Language Models for Influenza A Surveillance
Yanhua Xu
TL;DR
This work tackles the bottleneck in antigenicity labeling for Influenza A by leveraging Protein Language Models (PLMs) in combination with Semi-Supervised Learning (SSL) to predict HA antigenicity using largely unlabeled genomic data. By evaluating four PLM embeddings (ESM-2, ProtVec, ProtBert, ProtT5) across two SSL strategies (Self-training, Label Spreading) under nested cross-validation and simulated label scarcity, the study shows that SSL enhances performance when labels are scarce, with ESM-2 offering robust performance even at 25% labeling. The findings reveal subtype- and embedding-dependent dynamics, with H1N1/H9N2 predicted well while H3N2 remains challenging; SSL can mitigate some of this difficulty but may not fully overcome it. Overall, the PLM+SSL framework provides a data-efficient approach to prioritize variants and support timely vaccine strain selection in surveillance systems.
Abstract
Influenza A viruses (IAVs) evolve antigenically at a pace that requires frequent vaccine updates, yet the haemagglutination inhibition (HI) assays used to quantify antigenicity are labor-intensive and unscalable. As a result, genomic data vastly outpace available phenotypic labels, limiting the effectiveness of traditional supervised models. We hypothesize that combining pre-trained Protein Language Models (PLMs) with Semi-Supervised Learning (SSL) can retain high predictive accuracy even when labeled data are scarce. We evaluated two SSL strategies, Self-training and Label Spreading, against fully supervised baselines using four PLM-derived embeddings (ESM-2, ProtVec, ProtT5, ProtBert) applied to haemagglutinin (HA) sequences. A nested cross-validation framework simulated low-label regimes (25%, 50%, 75%, and 100% label availability) across four IAV subtypes (H1N1, H3N2, H5N1, H9N2). SSL consistently improved performance under label scarcity. Self-training with ProtVec produced the largest relative gains, showing that SSL can compensate for lower-resolution representations. ESM-2 remained highly robust, achieving F1 scores above 0.82 with only 25% labeled data, indicating that its embeddings capture key antigenic determinants. While H1N1 and H9N2 were predicted with high accuracy, the hypervariable H3N2 subtype remained challenging, although SSL mitigated the performance decline. These findings demonstrate that integrating PLMs with SSL can address the antigenicity labeling bottleneck and enable more effective use of unlabeled surveillance sequences, supporting rapid variant prioritization and timely vaccine strain selection.
