BMFM-RNA: whole-cell expression decoding improves transcriptomic foundation models

Michael M. Danziger; Bharath Dandala; Viatcheslav Gurev; Matthew Madgwick; Sivan Ravid; Tim Rumbell; Akira Koseki; Tal Kozlovski; Ching-Huei Tsou; Ella Barkan; Tanwi Biswas; Jielin Xu; Yishai Shimoni; Jianying Hu; Michal Rosen-Zvi

BMFM-RNA: whole-cell expression decoding improves transcriptomic foundation models

Michael M. Danziger, Bharath Dandala, Viatcheslav Gurev, Matthew Madgwick, Sivan Ravid, Tim Rumbell, Akira Koseki, Tal Kozlovski, Ching-Huei Tsou, Ella Barkan, Tanwi Biswas, Jielin Xu, Yishai Shimoni, Jianying Hu, Michal Rosen-Zvi

Abstract

Transcriptomic foundation models pretrained with masked language modeling can achieve low pretraining loss yet produce poor cell representations for downstream tasks. We introduce whole-cell expression decoding (WCED), where models reconstruct the entire gene vocabulary from a single CLS token embedding, even with limited inputs, creating a maximally informative bottleneck. WCED consistently outperforms MLM on all downstream metrics despite higher reconstruction error during training. Gene-level error tracking reveals that both methods preferentially learn genes whose expression co-varies with stable transcriptional programs rather than those driven by transient factors. We further add hierarchical cross-entropy loss that exploits Cell Ontology structure for zero-shot annotation at multiple granularity levels. Models trained with these objectives achieve best overall performance across CZI benchmarks, on zero-shot batch integration and linear probing cell-type annotation. Methods are implemented in biomed-multi-omic ( https://github.com/BiomedSciAI/biomed-multi-omic ), an open-source framework for transcriptomic foundation model development.

BMFM-RNA: whole-cell expression decoding improves transcriptomic foundation models

Abstract

BMFM-RNA: whole-cell expression decoding improves transcriptomic foundation models

Abstract

Paper Structure

Table of Contents

Figures (9)