Table of Contents
Fetching ...

Transformer-Based Representation Learning for Robust Gene Expression Modeling and Cancer Prognosis

Shuai Jiang, Saeed Hassanpour

TL;DR

GexBERT tackles the challenges of high dimensionality, sparsity, and missing values in bulk RNA-seq data by introducing a transformer-based autoencoder pretrained with a masking/restoration objective to learn context-aware gene representations. It demonstrates strong performance on three cancer-relevant tasks: pan-cancer classification from limited gene subsets, cancer-specific survival prediction through restoration of prognostic anchor genes, and robust missing-value imputation, with attention analyses revealing biologically meaningful gene patterns. The approach yields scalable, data-efficient, and interpretable transcriptomic modeling that can reduce profiling costs and support analyses in data-limited, clinical settings. Limitations include potential generalization gaps across independent datasets and the assumption of Missing Completely At Random in missingness scenarios, suggesting avenues for cross-platform validation and integration with multi-omics data.

Abstract

Transformer-based models have achieved remarkable success in natural language and vision tasks, but their application to gene expression analysis remains limited due to data sparsity, high dimensionality, and missing values. We present GexBERT, a transformer-based autoencoder framework for robust representation learning of gene expression data. GexBERT learns context-aware gene embeddings by pretraining on large-scale transcriptomic profiles with a masking and restoration objective that captures co-expression relationships among thousands of genes. We evaluate GexBERT across three critical tasks in cancer research: pan-cancer classification, cancer-specific survival prediction, and missing value imputation. GexBERT achieves state-of-the-art classification accuracy from limited gene subsets, improves survival prediction by restoring expression of prognostic anchor genes, and outperforms conventional imputation methods under high missingness. Furthermore, its attention-based interpretability reveals biologically meaningful gene patterns across cancer types. These findings demonstrate the utility of GexBERT as a scalable and effective tool for gene expression modeling, with translational potential in settings where gene coverage is limited or incomplete.

Transformer-Based Representation Learning for Robust Gene Expression Modeling and Cancer Prognosis

TL;DR

GexBERT tackles the challenges of high dimensionality, sparsity, and missing values in bulk RNA-seq data by introducing a transformer-based autoencoder pretrained with a masking/restoration objective to learn context-aware gene representations. It demonstrates strong performance on three cancer-relevant tasks: pan-cancer classification from limited gene subsets, cancer-specific survival prediction through restoration of prognostic anchor genes, and robust missing-value imputation, with attention analyses revealing biologically meaningful gene patterns. The approach yields scalable, data-efficient, and interpretable transcriptomic modeling that can reduce profiling costs and support analyses in data-limited, clinical settings. Limitations include potential generalization gaps across independent datasets and the assumption of Missing Completely At Random in missingness scenarios, suggesting avenues for cross-platform validation and integration with multi-omics data.

Abstract

Transformer-based models have achieved remarkable success in natural language and vision tasks, but their application to gene expression analysis remains limited due to data sparsity, high dimensionality, and missing values. We present GexBERT, a transformer-based autoencoder framework for robust representation learning of gene expression data. GexBERT learns context-aware gene embeddings by pretraining on large-scale transcriptomic profiles with a masking and restoration objective that captures co-expression relationships among thousands of genes. We evaluate GexBERT across three critical tasks in cancer research: pan-cancer classification, cancer-specific survival prediction, and missing value imputation. GexBERT achieves state-of-the-art classification accuracy from limited gene subsets, improves survival prediction by restoring expression of prognostic anchor genes, and outperforms conventional imputation methods under high missingness. Furthermore, its attention-based interpretability reveals biologically meaningful gene patterns across cancer types. These findings demonstrate the utility of GexBERT as a scalable and effective tool for gene expression modeling, with translational potential in settings where gene coverage is limited or incomplete.

Paper Structure

This paper contains 29 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of the GexBERT framework. a) Pretraining: The transformer learns gene co-expression patterns by encoding a subset of input genes and restoring another subset. b) Cancer classification: The encoder extracts gene expression representations for tumor type prediction. (c) Survival prediction: The model restores prognostic anchor gene expression, which is used in a Cox proportional hazards (CoxPH) model. d) Missing value imputation: The model imputes missing gene expression values to enhance downstream survival prediction.
  • Figure 2: UMAP visualization of GexBERT-generated embeddings. Low-dimensional projection of summary embeddings extracted from GexBERT with 4,096 input genes, showing distinct clustering patterns across cancer types.
  • Figure 3: Impact of restored anchor genes on cancer-specific survival prediction. Average concordance index (C-index) across different gene set sizes for three methods: using only original input genes (red), using restored anchor gene expression (green), and combining both (blue). The * symbol indicates statistically significant improvement of the "Both" method over the "Original Input" method.
  • Figure 4: Performance comparison of missing data imputation methods. C-index scores for survival prediction across different missing rates and gene set sizes, using GexBERT (red), KNN (green), mean imputation (blue), and MICE (purple). GexBERT consistently maintains higher predictive performance, demonstrating robustness to missing values compared to traditional imputation methods.
  • Figure 5: Gene attention patterns across cancer types. UMAP visualization of the top 512 genes with the highest attention weights for each cancer type, as identified by GexBERT. Darker red indicates higher attention values, highlighting key genes that contribute to transcriptomic differentiation.
  • ...and 2 more figures