AntigenLM: Structure-Aware DNA Language Modeling for Influenza

Yue Pei; Xuebin Chi; Yu Kang

AntigenLM: Structure-Aware DNA Language Modeling for Influenza

Yue Pei, Xuebin Chi, Yu Kang

TL;DR

AntigenLM addresses the challenge of forecasting rapidly evolving influenza by introducing a structure-aware, full-genome DNA language model pretrained on concatenated influenza A genomes with fixed segment order. Its functional-unit encoding preserves genome-wide context while sentinel tokens constrain segmentation during fine-tuning for forecasting and subtype classification. The model achieves lower amino-acid mismatches than state-of-the-art baselines across next-month and next-season forecasts, demonstrates robust cross-subtype and geographic generalization, and attains near-perfect subtype classification, with ablations confirming the importance of preserving genome structure. This work provides a generalizable framework for biologically grounded DNA foundation models with direct implications for vaccine design and predictive genomics.

Abstract

Language models have advanced sequence analysis, yet DNA foundation models often lag behind task-specific methods for unclear reasons. We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units. This structure-aware pretraining enables AntigenLM to capture evolutionary constraints and generalize across tasks. Fine-tuned on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training, outperforming phylogenetic and evolution-based models. It also achieves near-perfect subtype classification. Ablation studies show that disrupting genomic structure through fragmentation or shuffling severely degrades performance, revealing the importance of preserving functional-unit integrity in DNA language modeling. AntigenLM thus provides both a powerful framework for antigen evolution prediction and a general principle for building biologically grounded DNA foundation models.

AntigenLM: Structure-Aware DNA Language Modeling for Influenza

TL;DR

Abstract

Paper Structure (40 sections, 4 equations, 7 figures, 4 tables)

This paper contains 40 sections, 4 equations, 7 figures, 4 tables.

Introduction
Related Work
Classical Evolutionary Forecasting Methods
Deep Learning–based Evolutionary Models
Genomic Language Models
General-purpose and Influenza-specific Protein Language Models
Method
Model Overview
Functional-Unit Encoding
Complexity and efficiency
Fine-tuning for forecasting and classification
Experiments
Datasets
Pretraining Setup
Fine-Tuning and Evaluation Tasks
...and 25 more sections

Figures (7)

Figure 1: Data Distribution and AntigenLM Architecture. (A) Global distribution of influenza A virus sequences used for pretraining, fine-tuning, and testing. Circle size reflects sample count; pie sectors show subtype composition. Dark circles represent pretraining data, light circles fine-tuning data, and red ticks mark test regions. (B) AntigenLM architecture. Schematic illustration of the pretraining and finetuning phases. The model utilizes a GPT-style Transformer as a shared backbone (6 layers, hidden dimension of 384, and 6 attention heads). Top (Pretraining): The backbone is pretrained on nucleotide sequences spanning all eight influenza gene segments (PB2 to NS). Bottom (Finetuning): The model is fine-tuned for two distinct tasks: viral evolution prediction (left), which uses an LM head to predict sequences for month $t+1$ based on historical strains (months $t-2$ to $t$) from the same region; and subtype classification (right), which employs a classification head to identify the virus subtype based on HA and NA segments.
Figure 2: Pretraining input of AntigenLM and ablation variants. The standard Full-genome strategy (top) concatenates the eight segments from the same isolate in a fixed order. The Ablation variants (bottom) explore alternative input formatting: Segment-wise uses independent per-segment sequences; Incomplete-genome is generated by randomly cropping fixed-length windows from long concatenations, resulting in mixed segments; and Antigen-only models restrict input solely to the HA and NA segments using nucleotide and protein (amino acid) sequences, respectively. All the configurations except Antigen-only (protein) utilize nucleotide sequences.
Figure 3: AntigenLM Achieves the Lowest Amino-Acid Mismatch Across All Forecasting Tasks. (A) Next-month prediction: AntigenLM (full-genome pretraining) compared with ablation controls. (B) Next-season forecasting on post-2022 Japan data (with pre-2022 data included in fine-tuning): AntigenLM compared with baseline models. (C) Cross-subtype generalization in next-season forecasting: AntigenLM (full-genome pretraining) versus ablation controls for H7N9 prediction. (D) Geographic generalization: AntigenLM evaluated on U.S. data unseen during fine-tuning, compared with baseline models. Asterisks indicate statistical significance (t-test) between AntigenLM and beth-1: *$p < 10^{-3}$. Error bars show standard deviations.
Figure 4: Subtype classification performance of AntigenLM and ablation models. Row-normalized confusion matrices for subtype classification. AntigenLM (left) and Antigen-only (nucleotide) (right) show near-perfect diagonal dominance, indicating highly accurate classification, whereas Incomplete-genome and Segment-wise models (middle) show increased off-diagonal misclassifications, particularly for rare subtypes.
Figure 5: Pretraining loss versus steps for the three variants.
...and 2 more figures

AntigenLM: Structure-Aware DNA Language Modeling for Influenza

TL;DR

Abstract

AntigenLM: Structure-Aware DNA Language Modeling for Influenza

Authors

TL;DR

Abstract

Table of Contents

Figures (7)