Transformer-Based Representation Learning for Robust Gene Expression Modeling and Cancer Prognosis
Shuai Jiang, Saeed Hassanpour
TL;DR
GexBERT tackles the challenges of high dimensionality, sparsity, and missing values in bulk RNA-seq data by introducing a transformer-based autoencoder pretrained with a masking/restoration objective to learn context-aware gene representations. It demonstrates strong performance on three cancer-relevant tasks: pan-cancer classification from limited gene subsets, cancer-specific survival prediction through restoration of prognostic anchor genes, and robust missing-value imputation, with attention analyses revealing biologically meaningful gene patterns. The approach yields scalable, data-efficient, and interpretable transcriptomic modeling that can reduce profiling costs and support analyses in data-limited, clinical settings. Limitations include potential generalization gaps across independent datasets and the assumption of Missing Completely At Random in missingness scenarios, suggesting avenues for cross-platform validation and integration with multi-omics data.
Abstract
Transformer-based models have achieved remarkable success in natural language and vision tasks, but their application to gene expression analysis remains limited due to data sparsity, high dimensionality, and missing values. We present GexBERT, a transformer-based autoencoder framework for robust representation learning of gene expression data. GexBERT learns context-aware gene embeddings by pretraining on large-scale transcriptomic profiles with a masking and restoration objective that captures co-expression relationships among thousands of genes. We evaluate GexBERT across three critical tasks in cancer research: pan-cancer classification, cancer-specific survival prediction, and missing value imputation. GexBERT achieves state-of-the-art classification accuracy from limited gene subsets, improves survival prediction by restoring expression of prognostic anchor genes, and outperforms conventional imputation methods under high missingness. Furthermore, its attention-based interpretability reveals biologically meaningful gene patterns across cancer types. These findings demonstrate the utility of GexBERT as a scalable and effective tool for gene expression modeling, with translational potential in settings where gene coverage is limited or incomplete.
