xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

Jing Gong; Minsheng Hao; Xingyi Cheng; Xin Zeng; Chiming Liu; Jianzhu Ma; Xuegong Zhang; Taifeng Wang; Le Song

xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

Jing Gong, Minsheng Hao, Xingyi Cheng, Xin Zeng, Chiming Liu, Jianzhu Ma, Xuegong Zhang, Taifeng Wang, Le Song

TL;DR

xTrimoGene tackles the challenge of learning from massive, sparse scRNA-seq data by introducing an asymmetric encoder-decoder Transformer that exploits sparsity to dramatically reduce computation while preserving high-resolution continuous expression semantics. It employs an auto-discretization strategy and a regression-based masking objective to pre-train on billions of gene tokens, enabling scalable models up to ~100M parameters. The approach achieves state-of-the-art results on cell type annotation, perturbation response prediction, and drug synergy tasks, and demonstrates strong robustness to high sparsity and scalable training dynamics. This work enables efficient, large-scale representation learning for single-cell biology and offers a practical service for downstream scRNA-seq analyses.

Abstract

Advances in high-throughput sequencing technology have led to significant progress in measuring gene expressions at the single-cell level. The amount of publicly available single-cell RNA-seq (scRNA-seq) data is already surpassing 50M records for humans with each record measuring 20,000 genes. This highlights the need for unsupervised representation learning to fully ingest these data, yet classical transformer architectures are prohibitive to train on such data in terms of both computation and memory. To address this challenge, we propose a novel asymmetric encoder-decoder transformer for scRNA-seq data, called xTrimoGene$^α$ (or xTrimoGene for short), which leverages the sparse characteristic of the data to scale up the pre-training. This scalable design of xTrimoGene reduces FLOPs by one to two orders of magnitude compared to classical transformers while maintaining high accuracy, enabling us to train the largest transformer models over the largest scRNA-seq dataset today. Our experiments also show that the performance of xTrimoGene improves as we scale up the model sizes, and it also leads to SOTA performance over various downstream tasks, such as cell type annotation, perturb-seq effect prediction, and drug combination prediction. xTrimoGene model is now available for use as a service via the following link: https://api.biomap.com/xTrimoGene/apply.

xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

TL;DR

Abstract

(or xTrimoGene for short), which leverages the sparse characteristic of the data to scale up the pre-training. This scalable design of xTrimoGene reduces FLOPs by one to two orders of magnitude compared to classical transformers while maintaining high accuracy, enabling us to train the largest transformer models over the largest scRNA-seq dataset today. Our experiments also show that the performance of xTrimoGene improves as we scale up the model sizes, and it also leads to SOTA performance over various downstream tasks, such as cell type annotation, perturb-seq effect prediction, and drug combination prediction. xTrimoGene model is now available for use as a service via the following link: https://api.biomap.com/xTrimoGene/apply.

Paper Structure (18 sections, 5 equations, 4 figures, 2 tables)

This paper contains 18 sections, 5 equations, 4 figures, 2 tables.

Introduction
Characteristics of Single-Cell RNA-seq Data
xTrimoGene Architecture
Encoder
Decoder
Auto-discretization strategy
Training Strategy
Regression masked task
Masking strategy
Experiments
Computational efficiency
Scalability
Robustness on high sparse data
Evaluation on downstream tasks
Cell type annotation
...and 3 more sections

Figures (4)

Figure 1: The xTrimoGene Framework: (1) Random positions (including both zero and non-zero values) are masked for prediction. (2) Masked and zero-valued positions are filtered out. (3) Remaining unmasked positions are aligned with padding tokens (grey) to ensure maximum length consistency within a batch. (4) Gene expression values and gene embeddings are separately projected into embeddings. (5) These two embeddings are element-wise added. (6) The resulting input is fed into the encoder. (7) The intermediate encoder embedding is combined with embeddings for masked positions and zero embeddings. (8) This combined representation is then fed into the decoder. (9) Decoder embedding is projected to model output with an MLP layer. The MSE loss is calculated between the model output and ground truth values for the masked positions.
Figure 2: Pre-training strategy ablation study. (A) Performance comparison between auto discretization strategy and other binning methods for expression value projection. The cell clustering task is evaluated and five metrics are displayed. ARI for Adjusted Rand index, NMI for Normalized Mutual Information, HOMO for Homogeneity, CP for Completeness and SIL for Silhouette Coefficient. (B) Performance of pre-trained models with different task modes, including regression and classification settings. The cell clustering task is evaluated. See the main text for details.
Figure 3: Comparison of performance for different sparse level data. (A) xTrimoGene performance for recovering masked values at different sparse levels. Each dot represents a subset defined by cell type. Sparse level is calculated as the ratio between zero value percentages. Pearson correlation coefficient metric is calculated on masked positions. (B) Performance comparison of xTrimoGene and Performer while recovering masked values at different sparse levels. Dot has the same meaning as (A) but the dot size is proportional to the sparse level. Both the x and y axis denotes the Pearson correlation coefficient metric for a particular algorithm. (C) Comparison of performance for xTrimoGene framework and encoder-only framework. Cell clustering task is evaluated.
Figure 4: (A) The MSE of the top 20 deferentially expressed (DE) genes given by different models on perturbation response prediction. The top 20 DE genes are calculated between the before and post-perturbation expression profiles. "Total" denotes evaluating all test perturbation sets. "1-gene" denotes evaluation on the single gene perturbation subset, where the perturbed target is not seen in the training set. "2-gene" represents the sub-test set for perturbing two genes simultaneously. "seen0", "seen1" and "seen2" denotes zero, one or two perturbed targets are not seen in the training set, respectively. The black line denotes a 95% confidence interval. (B) ROC curve of different models on drug combination synergy prediction task. xTrimoGene denotes replacing the raw expression profile with context embeddings in the DeepDDS framework and others remain unchanged. Refer to App. 8.3 for more details.

xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

TL;DR

Abstract

xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

Authors

TL;DR

Abstract

Table of Contents

Figures (4)