Table of Contents
Fetching ...

LangCell: Language-Cell Pre-training for Cell Identity Understanding

Suyuan Zhao, Jiahuan Zhang, Yushuai Wu, Yizhen Luo, Zaiqing Nie

TL;DR

LangCell introduces a true Language-Cell pre-training framework that unifies single-cell transcriptomic data with natural language descriptions to learn cross-modal representations of cell identity. By jointly training four losses across a two-stage pre-training regime, LangCell achieves state-of-the-art zero-shot, few-shot, and fine-tuned performance on diverse tasks, including novel cell type identification, NSCLC subtype classification, and cell batch integration. The model demonstrates robust transfer through a cell encoder and a text-aware multimodal module, enabling direct zero-shot inference and strong downstream performance with limited labeled data. This cross-modal approach holds practical impact for rapid, scalable cell identity understanding and annotation in diverse biomedical contexts, while acknowledging limitations related to text source diversity and omics coverage.

Abstract

Cell identity encompasses various semantic aspects of a cell, including cell type, pathway information, disease information, and more, which are essential for biologists to gain insights into its biological characteristics. Understanding cell identity from the transcriptomic data, such as annotating cell types, has become an important task in bioinformatics. As these semantic aspects are determined by human experts, it is impossible for AI models to effectively carry out cell identity understanding tasks without the supervision signals provided by single-cell and label pairs. The single-cell pre-trained language models (PLMs) currently used for this task are trained only on a single modality, transcriptomics data, lack an understanding of cell identity knowledge. As a result, they have to be fine-tuned for downstream tasks and struggle when lacking labeled data with the desired semantic labels. To address this issue, we propose an innovative solution by constructing a unified representation of single-cell data and natural language during the pre-training phase, allowing the model to directly incorporate insights related to cell identity. More specifically, we introduce $\textbf{LangCell}$, the first $\textbf{Lang}$uage-$\textbf{Cell}$ pre-training framework. LangCell utilizes texts enriched with cell identity information to gain a profound comprehension of cross-modal knowledge. Results from experiments conducted on different benchmarks show that LangCell is the only single-cell PLM that can work effectively in zero-shot cell identity understanding scenarios, and also significantly outperforms existing models in few-shot and fine-tuning cell identity understanding scenarios.

LangCell: Language-Cell Pre-training for Cell Identity Understanding

TL;DR

LangCell introduces a true Language-Cell pre-training framework that unifies single-cell transcriptomic data with natural language descriptions to learn cross-modal representations of cell identity. By jointly training four losses across a two-stage pre-training regime, LangCell achieves state-of-the-art zero-shot, few-shot, and fine-tuned performance on diverse tasks, including novel cell type identification, NSCLC subtype classification, and cell batch integration. The model demonstrates robust transfer through a cell encoder and a text-aware multimodal module, enabling direct zero-shot inference and strong downstream performance with limited labeled data. This cross-modal approach holds practical impact for rapid, scalable cell identity understanding and annotation in diverse biomedical contexts, while acknowledging limitations related to text source diversity and omics coverage.

Abstract

Cell identity encompasses various semantic aspects of a cell, including cell type, pathway information, disease information, and more, which are essential for biologists to gain insights into its biological characteristics. Understanding cell identity from the transcriptomic data, such as annotating cell types, has become an important task in bioinformatics. As these semantic aspects are determined by human experts, it is impossible for AI models to effectively carry out cell identity understanding tasks without the supervision signals provided by single-cell and label pairs. The single-cell pre-trained language models (PLMs) currently used for this task are trained only on a single modality, transcriptomics data, lack an understanding of cell identity knowledge. As a result, they have to be fine-tuned for downstream tasks and struggle when lacking labeled data with the desired semantic labels. To address this issue, we propose an innovative solution by constructing a unified representation of single-cell data and natural language during the pre-training phase, allowing the model to directly incorporate insights related to cell identity. More specifically, we introduce , the first uage- pre-training framework. LangCell utilizes texts enriched with cell identity information to gain a profound comprehension of cross-modal knowledge. Results from experiments conducted on different benchmarks show that LangCell is the only single-cell PLM that can work effectively in zero-shot cell identity understanding scenarios, and also significantly outperforms existing models in few-shot and fine-tuning cell identity understanding scenarios.
Paper Structure (60 sections, 14 equations, 14 figures, 18 tables)

This paper contains 60 sections, 14 equations, 14 figures, 18 tables.

Figures (14)

  • Figure 1: (a). Plots of zero- and few-shot cell type annotation. The curve shows the average F$_1$-scores on PBMC10K and PBMC3&68K, for two settings of LangCell and three of the best single-cell PLMs. (b). UMAP plot of embeddings for scRNA-seq data and descriptions of three similar cell types in PBMC10K. LangCell aligns single-cell and text embeddings.
  • Figure 2: The schematic overview of LangCell. For each single-cell data containing a pair of scRNA-seq data and metadata: (1) During preprocessing, the scRNA-seq data is converted into a gene sequence arranged in descending order of relative expression levels, and a multi-perspective textual description of the cell is obtained from the metadata using OBO Foundry. (2) The embeddings of the cell and text are obtained using the cell encoder ($\bm{f}$) and the unimodal mode ($\bm{g_1}$) of the text encoder, and the matching score $p_{c,t}$ is calculated using the multimodal mode ($\bm{g_2}$) of the text encoder. (3) Pre-training is conducted through joint optimization of four loss functions. Among them, Masked Gene Modeling (MGM) and Cell-Cell Contrastive Learning (C-C) aim to enhance single-cell representation learning. In contrast, Cell-Text Contrastive Learning (C-T) and Cell-Text Matching (CTM) aim to train the model to understand the latent connections between single-cell and textual data.
  • Figure 3: The application of LangCell in zero-shot cell identity understanding. LangCell obtains Similarity Scores using the shared embedding space of cell and text data, obtains Cell-Text Matching Scores through the matching module, and considers these comprehensively to obtain the final classification logits. In the figure, the symbol $\oplus$ represents the weighted sum after the Softmax operation.
  • Figure 4: Results of cell-text retrieval. Zero-shot LangCell surpasses BioTranslator trained on up to 30% of the 161 types.
  • Figure 5: UMAP plot of embeddings for scRNA-seq data and descriptions of two NSCLC subtypes.
  • ...and 9 more figures