AntigenLM: Structure-Aware DNA Language Modeling for Influenza
Yue Pei, Xuebin Chi, Yu Kang
TL;DR
AntigenLM addresses the challenge of forecasting rapidly evolving influenza by introducing a structure-aware, full-genome DNA language model pretrained on concatenated influenza A genomes with fixed segment order. Its functional-unit encoding preserves genome-wide context while sentinel tokens constrain segmentation during fine-tuning for forecasting and subtype classification. The model achieves lower amino-acid mismatches than state-of-the-art baselines across next-month and next-season forecasts, demonstrates robust cross-subtype and geographic generalization, and attains near-perfect subtype classification, with ablations confirming the importance of preserving genome structure. This work provides a generalizable framework for biologically grounded DNA foundation models with direct implications for vaccine design and predictive genomics.
Abstract
Language models have advanced sequence analysis, yet DNA foundation models often lag behind task-specific methods for unclear reasons. We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units. This structure-aware pretraining enables AntigenLM to capture evolutionary constraints and generalize across tasks. Fine-tuned on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training, outperforming phylogenetic and evolution-based models. It also achieves near-perfect subtype classification. Ablation studies show that disrupting genomic structure through fragmentation or shuffling severely degrades performance, revealing the importance of preserving functional-unit integrity in DNA language modeling. AntigenLM thus provides both a powerful framework for antigen evolution prediction and a general principle for building biologically grounded DNA foundation models.
