Table of Contents
Fetching ...

HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity

Xuejun Sun, Yiran Song, Xiaochen Zhou, Ruilie Cai, Yu Zhang, Xinyi Li, Rui Peng, Jialiu Xie, Yuanyuan Yan, Muyao Tang, Prem Lakshmanane, Baiming Zou, James S. Hagood, Raymond J. Pickles, Didong Li, Fei Zou, Xiaojing Zheng

TL;DR

HR-VILAGE-3K3M introduces the largest longitudinal transcriptomic resource for human respiratory viral immunization by integrating 14,136 expression profiles from 3,178 individuals across 66 studies, spanning microarray, bulk RNA-seq, and scRNA-seq from blood and nasal tissues. The authors implement rigorous metadata harmonization, HGNC-aligned gene annotations, and standardized preprocessing, creating an AI-ready benchmark hosted on Hugging Face for cross-study analyses and method benchmarking. They demonstrate utility with batch-corrected predictive modeling of vaccine responders and combined bulk-scRNA analyses for cell-type dynamics, illustrating the platform's potential for multimodal learning and foundation-model pretraining in systems immunology. While powerful, the resource focuses on transcriptomics and omits multi-omics layers and complete antibody data, suggesting future extensions to broaden applicability and mechanistic insight in infectious disease research.

Abstract

Respiratory viral infections pose a global health burden, yet the cellular immune responses driving protection or pathology remain unclear. Natural infection cohorts often lack pre-exposure baseline data and structured temporal sampling. In contrast, inoculation and vaccination trials generate insightful longitudinal transcriptomic data. However, the scattering of these datasets across platforms, along with inconsistent metadata and preprocessing procedure, hinders AI-driven discovery. To address these challenges, we developed the Human Respiratory Viral Immunization LongitudinAl Gene Expression (HR-VILAGE-3K3M) repository: an AI-ready, rigorously curated dataset that integrates 14,136 RNA-seq profiles from 3,178 subjects across 66 studies encompassing over 2.56 million cells. Spanning vaccination, inoculation, and mixed exposures, the dataset includes microarray, bulk RNA-seq, and single-cell RNA-seq from whole blood, PBMCs, and nasal swabs, sourced from GEO, ImmPort, and ArrayExpress. We harmonized subject-level metadata, standardized outcome measures, applied unified preprocessing pipelines with rigorous quality control, and aligned all data to official gene symbols. To demonstrate the utility of HR-VILAGE-3K3M, we performed predictive modeling of vaccine responders and evaluated batch-effect correction methods. Beyond these initial demonstrations, it supports diverse systems immunology applications and benchmarking of feature selection and transfer learning algorithms. Its scale and heterogeneity also make it ideal for pretraining foundation models of the human immune response and for advancing multimodal learning frameworks. As the largest longitudinal transcriptomic resource for human respiratory viral immunization, it provides an accessible platform for reproducible AI-driven research, accelerating systems immunology and vaccine development against emerging viral threats.

HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity

TL;DR

HR-VILAGE-3K3M introduces the largest longitudinal transcriptomic resource for human respiratory viral immunization by integrating 14,136 expression profiles from 3,178 individuals across 66 studies, spanning microarray, bulk RNA-seq, and scRNA-seq from blood and nasal tissues. The authors implement rigorous metadata harmonization, HGNC-aligned gene annotations, and standardized preprocessing, creating an AI-ready benchmark hosted on Hugging Face for cross-study analyses and method benchmarking. They demonstrate utility with batch-corrected predictive modeling of vaccine responders and combined bulk-scRNA analyses for cell-type dynamics, illustrating the platform's potential for multimodal learning and foundation-model pretraining in systems immunology. While powerful, the resource focuses on transcriptomics and omits multi-omics layers and complete antibody data, suggesting future extensions to broaden applicability and mechanistic insight in infectious disease research.

Abstract

Respiratory viral infections pose a global health burden, yet the cellular immune responses driving protection or pathology remain unclear. Natural infection cohorts often lack pre-exposure baseline data and structured temporal sampling. In contrast, inoculation and vaccination trials generate insightful longitudinal transcriptomic data. However, the scattering of these datasets across platforms, along with inconsistent metadata and preprocessing procedure, hinders AI-driven discovery. To address these challenges, we developed the Human Respiratory Viral Immunization LongitudinAl Gene Expression (HR-VILAGE-3K3M) repository: an AI-ready, rigorously curated dataset that integrates 14,136 RNA-seq profiles from 3,178 subjects across 66 studies encompassing over 2.56 million cells. Spanning vaccination, inoculation, and mixed exposures, the dataset includes microarray, bulk RNA-seq, and single-cell RNA-seq from whole blood, PBMCs, and nasal swabs, sourced from GEO, ImmPort, and ArrayExpress. We harmonized subject-level metadata, standardized outcome measures, applied unified preprocessing pipelines with rigorous quality control, and aligned all data to official gene symbols. To demonstrate the utility of HR-VILAGE-3K3M, we performed predictive modeling of vaccine responders and evaluated batch-effect correction methods. Beyond these initial demonstrations, it supports diverse systems immunology applications and benchmarking of feature selection and transfer learning algorithms. Its scale and heterogeneity also make it ideal for pretraining foundation models of the human immune response and for advancing multimodal learning frameworks. As the largest longitudinal transcriptomic resource for human respiratory viral immunization, it provides an accessible platform for reproducible AI-driven research, accelerating systems immunology and vaccine development against emerging viral threats.

Paper Structure

This paper contains 11 sections, 4 figures.

Figures (4)

  • Figure 1: Overview of HR-VILAGE-3K3M. (a) HR-VILAGE-3K3M construction workflow. (b) Distribution of sample timepoints for vaccine and inoculation studies, shown separately for bulk RNA-seq and single-cell RNA-seq datasets. (c) Composition of the dataset, stratified by platform, tissue type, study type, and pathogen, including both bulk and single-cell transcriptomic studies.
  • Figure 2: Evaluation of Batch Effect Correction and Predictive Modeling on the HV-RIGEL-3K Dataset. (a) Batch Effect Correction Visualization. t-SNE plots display sample clustering before and after batch effect correction using QN, Regression, and ComBat. Points are colored by responder status (left panels) or study ID (right panels) to assess preservation of biological signal and reduction of batch-specific variation. (b) Antibody Responder Prediction Performance. Bar plots show mean accuracy, AUC, and F1 score across five modeling approaches—PCA-Logistic, RNN, LSTM, GRU, and Transformer—under three batch correction methods (QN, Regression, ComBat). Results are averaged across stratified 5-fold cross-validation and six random seeds.
  • Figure 3: Comparison of immune cell-type composition and transcriptional activity using paired single-cell and bulk RNA-seq. (a) UMAP visualization of scRNA-seq data showing major immune cell types, colored by cell type annotation. (b) Bar plot of estimated immune cell-type proportions at Day 1 and Day 7 derived from single-cell RNA-seq data. (c) Bar plot of xCell enrichment scores for the same cell types based on bulk RNA-seq data at Day 1 and Day 7.
  • Figure 4: Potential tasks using HR-VILAGE-3K3M. (a) Change point detection. (b) Gene expression imputation. (c) Causal inference (d) Deconvolution. (e) Building prediction model. (f) Pretraining and fine-tuning.