Table of Contents
Fetching ...

BMFM-RNA: whole-cell expression decoding improves transcriptomic foundation models

Michael M. Danziger, Bharath Dandala, Viatcheslav Gurev, Matthew Madgwick, Sivan Ravid, Tim Rumbell, Akira Koseki, Tal Kozlovski, Ching-Huei Tsou, Ella Barkan, Tanwi Biswas, Jielin Xu, Yishai Shimoni, Jianying Hu, Michal Rosen-Zvi

Abstract

Transcriptomic foundation models pretrained with masked language modeling can achieve low pretraining loss yet produce poor cell representations for downstream tasks. We introduce whole-cell expression decoding (WCED), where models reconstruct the entire gene vocabulary from a single CLS token embedding, even with limited inputs, creating a maximally informative bottleneck. WCED consistently outperforms MLM on all downstream metrics despite higher reconstruction error during training. Gene-level error tracking reveals that both methods preferentially learn genes whose expression co-varies with stable transcriptional programs rather than those driven by transient factors. We further add hierarchical cross-entropy loss that exploits Cell Ontology structure for zero-shot annotation at multiple granularity levels. Models trained with these objectives achieve best overall performance across CZI benchmarks, on zero-shot batch integration and linear probing cell-type annotation. Methods are implemented in biomed-multi-omic ( https://github.com/BiomedSciAI/biomed-multi-omic ), an open-source framework for transcriptomic foundation model development.

BMFM-RNA: whole-cell expression decoding improves transcriptomic foundation models

Abstract

Transcriptomic foundation models pretrained with masked language modeling can achieve low pretraining loss yet produce poor cell representations for downstream tasks. We introduce whole-cell expression decoding (WCED), where models reconstruct the entire gene vocabulary from a single CLS token embedding, even with limited inputs, creating a maximally informative bottleneck. WCED consistently outperforms MLM on all downstream metrics despite higher reconstruction error during training. Gene-level error tracking reveals that both methods preferentially learn genes whose expression co-varies with stable transcriptional programs rather than those driven by transient factors. We further add hierarchical cross-entropy loss that exploits Cell Ontology structure for zero-shot annotation at multiple granularity levels. Models trained with these objectives achieve best overall performance across CZI benchmarks, on zero-shot batch integration and linear probing cell-type annotation. Methods are implemented in biomed-multi-omic ( https://github.com/BiomedSciAI/biomed-multi-omic ), an open-source framework for transcriptomic foundation model development.

Paper Structure

This paper contains 84 sections, 10 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Pretraining overview of BMFM-RNA models.A. Multitask architecture design. B. Whole-cell expression decoder (WCED) training objective. Part of the cell's expression data is input into the model (green box) while some of it is left outside (red box). The input sequence is passed through an encoder which produces token level embeddings. The CLS token embedding is passed to the WCED decoder, producing two sets of logits the size of the gene vocabulary. The expression levels of all genes are then used to calculate the binary and mean squared error losses. C. The package provides a simple scanpy-style API for zero-shot inference on single-cell RNA-seq data. The bmfm.inference() function generates cell embeddings and predictions directly from pre-trained models held on HuggingFace. The results are then stored in the AnnData object allowing users to integrate with their existing workflows. D. CellXGene validation corpus wide zero-shot cell type annotation. E. Roll-up hierarchically aware zero-shot cell type annotation.
  • Figure 2: Gene-level error analysis reveals a hierarchy of learnability governed by contextual predictability.A. Distribution of rescaled MAE across 33 gene families, ordered by median improvement over a null baseline (predicting the validation mean). Families cluster into four tiers: Tier 1 (green; mitochondrial-encoded, cytosolic ribosomal, MHC) achieves 20--45% improvement; Tier 4 (red; zinc-finger TFs, histones) achieves $<$5%. Violin width reflects gene count. Gene family definitions in Supplementary Table \ref{['tab:gene_families']}. B. Per-family training dynamics showing median rescaled MAE over 10 epochs. Constitutive families (mitochondrial genome, cytosolic ribosomes) converge within the first epoch, while most families plateau by epoch 2, indicating that the learnability hierarchy reflects data structure rather than optimization dynamics. C. Winner and loser genes identified by heteroscedastic outlier detection (Supplementary Section \ref{['sec:supp-winners']}). After fitting an isotonic regression of rescaled MAE on zero fraction with LOWESS-estimated local variance, genes with $|Z| > 2$ and $>$10% practical significance are flagged as winners (green; $n = 909$) or losers (red; $n = 6$). Classification is performed across the full sparsity range, but biological interpretation focuses on the sparse regime (zero fraction $> 0.5$), where functionally diverse local comparison groups make Z-scores most meaningful. Winners overwhelmingly cluster in this regime, corresponding to cell-type markers whose expression is contextually predictable despite extreme sparsity. The few losers all fall in the dense regime where interpretability is limited (see Supplementary Section \ref{['sec:supp-winners']}). D. Winner enrichment by expression program (Supplementary Section \ref{['sec:axis2-regime']}), showing the percentage of genes in each program classified as winners for both WCED and MLM. Identity-associated genes are strongly enriched (WCED: OR $= 2.36$, $p = 7.1 \times 10^{-24}$; MLM: OR $= 2.92$, $p = 5.3 \times 10^{-8}$), while cell-cycle genes (OR $= 0.17$) and dissociation-response genes (OR $= 0.69$, n.s.) are not, confirming that contextual predictability, not biological function per se, determines relative learnability. Note that the low constitutive winner rate reflects the dense regime where Z-score comparisons are among a narrow, homogeneous gene population (see Supplementary Section \ref{['sec:supp-winners']}).
  • Figure 3: Zero-shot predictions of Cell Ontology labels for Immune Atlas dataset. A. Sankey diagram showing 15 most frequent labels in the test data (left) mapped to 15 most frequent zero-shot labels from the model (right). 'Other' pools all other labels in the test dataset. Flow colors show exact label matches (green), descendant label matches (blue), label misses (red), and other (grey). B. Graph of nodes in the Cell Ontology with mean probability $>0.035$ for cells with ground truth label "central memory CD4-positive, alpha-beta T cell" (2327 cells). The model predicts leaf node (bold outlines) probabilities directly, and non-leaf node probabilites are the sum of all leaf node descendants. Darker colors show high probabilities, colors show exact match (green), ancestor match (purple), misses (red). C. Flow of labels assigned to cells, starting from the ontology root ('cell', left) and iteratively selecting the highest probability child for each cell until a leaf (right side) node is reached.
  • Figure 4: Cross-benchmark evaluation of cell embedding models on CZ CELLxGENE clustering and classification tasks.(a) Per-tissue $z$-scores relative to the field mean for unsupervised clustering quality (Avg Bio = mean of ARI, NMI, and silhouette score; solid bars) and supervised cell type classification (Macro F1; hatched bars) across five Tabula Sapiens v2 tissues tabula_sapiens_v2. All results were obtained using the CZ CELLxGENE benchmarking suite cz-benchmarks with published data splits and evaluation protocols. Individual dots show per-tissue $z$-scores; diamond and circle markers connected by a dashed line show the joint $z$-score (mean of Avg Bio and Macro F1), which determines the vertical ordering. BMFM-RNA multitask checkpoints (32--79M parameters) occupy the top ranks, with the WCED+MLM multitask concatenation achieving the highest joint score. TranscriptFormer variants pearce_cross-species_2025 (368--542M parameters) rank highly on classification but fall below the field mean on clustering, consistent with a 2048-dimensional representation that preserves input signal without reorganizing it into biologically structured geometry. (b, c) Performance versus model size for Avg Bio (b) and Macro F1 (c). The Pareto front (coral line and rings) traces the best performance achievable at each parameter budget. BMFM-RNA checkpoints define the Pareto front for Avg Bio above scVI lopez2018deep, achieving the highest clustering scores in the comparison at 6--17x fewer parameters than TranscriptFormer. On Macro F1, BMFM-RNA (WCED+MLM multitask concatenation) reaches the Pareto front at 79M parameters, matching TranscriptFormer's classification accuracy at roughly one-fifth the model size.
  • Figure 5: Cell Type Annotation Fine Tuning results.A. F1 scores of classifiers for 9 scEval datasets split by batch with binomial confidence interval at 95%. For each dataset we fine tune each of the models over the Cell Type Annotation task with unfrozen encoder for five epochs and compare to an SGD classifier trained over extracted model embeddings. The fine-tuned models achieve better performance with average F1 improvement of 0.028 for MLM MULTITASK and 0.039 for WCED MULTITASK. B. Classification F1 score with binomial confidence interval at 95% for fine tuning the Myeloid and Multiple Sclerosis datasets reported by scGPT using BMFM-RNA models. C. Classification accuracy with binomial confidence interval at 95% for fine tuning the Myeloid and Multiple Sclerosis datasets reported by scGPT using BMFM-RNA models. Results were calculated on the same splits as scGPT used, based on the files shared in their paper.
  • ...and 4 more figures