Table of Contents
Fetching ...

Cell2Text: Multimodal LLM for Generating Single-Cell Descriptions from RNA-Seq Data

Oussama Kharouiche, Aris Markogiannakis, Xiao Fei, Michail Chatzianastasis, Michalis Vazirgiannis

TL;DR

Cell2Text introduces a multimodal framework that converts scRNA-seq profiles into structured natural language descriptions by linking gene-level embeddings from a pretrained single-cell encoder to an instruction-tuned LLM. The approach achieves competitive classification on cell type, tissue, and disease tasks and delivers ontologically coherent predictions via a PageRank-based similarity metric, while preserving high semantic fidelity in generated text. By generating descriptive text rather than fixed labels, the method yields richer representations and scalable characterization of unseen cells, validated on a 1M-cell CELLxGENE-derived dataset. The work highlights the potential of integrating domain-specific pretrained models with LLMs to produce interpretable biology-focused outputs and to support label-efficient analyses in large-scale single-cell datasets.

Abstract

Single-cell RNA sequencing has transformed biology by enabling the measurement of gene expression at cellular resolution, providing information for cell types, states, and disease contexts. Recently, single-cell foundation models have emerged as powerful tools for learning transferable representations directly from expression profiles, improving performance on classification and clustering tasks. However, these models are limited to discrete prediction heads, which collapse cellular complexity into predefined labels that fail to capture the richer, contextual explanations biologists need. We introduce Cell2Text, a multimodal generative framework that translates scRNA-seq profiles into structured natural language descriptions. By integrating gene-level embeddings from single-cell foundation models with pretrained large language models, Cell2Text generates coherent summaries that capture cellular identity, tissue origin, disease associations, and pathway activity, generalizing to unseen cells. Empirically, Cell2Text outperforms baselines on classification accuracy, demonstrates strong ontological consistency using PageRank-based similarity metrics, and achieves high semantic fidelity in text generation. These results demonstrate that coupling expression data with natural language offers both stronger predictive performance and inherently interpretable outputs, pointing to a scalable path for label-efficient characterization of unseen cells.

Cell2Text: Multimodal LLM for Generating Single-Cell Descriptions from RNA-Seq Data

TL;DR

Cell2Text introduces a multimodal framework that converts scRNA-seq profiles into structured natural language descriptions by linking gene-level embeddings from a pretrained single-cell encoder to an instruction-tuned LLM. The approach achieves competitive classification on cell type, tissue, and disease tasks and delivers ontologically coherent predictions via a PageRank-based similarity metric, while preserving high semantic fidelity in generated text. By generating descriptive text rather than fixed labels, the method yields richer representations and scalable characterization of unseen cells, validated on a 1M-cell CELLxGENE-derived dataset. The work highlights the potential of integrating domain-specific pretrained models with LLMs to produce interpretable biology-focused outputs and to support label-efficient analyses in large-scale single-cell datasets.

Abstract

Single-cell RNA sequencing has transformed biology by enabling the measurement of gene expression at cellular resolution, providing information for cell types, states, and disease contexts. Recently, single-cell foundation models have emerged as powerful tools for learning transferable representations directly from expression profiles, improving performance on classification and clustering tasks. However, these models are limited to discrete prediction heads, which collapse cellular complexity into predefined labels that fail to capture the richer, contextual explanations biologists need. We introduce Cell2Text, a multimodal generative framework that translates scRNA-seq profiles into structured natural language descriptions. By integrating gene-level embeddings from single-cell foundation models with pretrained large language models, Cell2Text generates coherent summaries that capture cellular identity, tissue origin, disease associations, and pathway activity, generalizing to unseen cells. Empirically, Cell2Text outperforms baselines on classification accuracy, demonstrates strong ontological consistency using PageRank-based similarity metrics, and achieves high semantic fidelity in text generation. These results demonstrate that coupling expression data with natural language offers both stronger predictive performance and inherently interpretable outputs, pointing to a scalable path for label-efficient characterization of unseen cells.

Paper Structure

This paper contains 34 sections, 1 equation, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of the Cell2Text framework. The model takes single-cell RNA-seq profiles as input and processes them through a pretrained Geneformer encoder to generate contextualized gene-level embeddings. These embeddings are projected into the semantic space of the language model via a lightweight adapter module, aligning biological signals with linguistic representations. A pretrained, instruction-tuned LLM decoder then generates structured natural language descriptions that capture cellular identity, tissue of origin, disease associations, and pathway activity.
  • Figure 2: Token length distribution of gene expression sequences after tokenization with the Geneformer tokenizer.
  • Figure 3: Token length distribution of natural language descriptions after tokenization with the Llama-3.2-1B-Instruct tokenizer.
  • Figure 4: Overview of the distribution of cell types in the dataset. For clarity, only the 30 most abundant categories out of 783 are shown.
  • Figure 5: Overview of the distribution of disease categories in the dataset. For clarity, only the 30 most abundant categories out of 128 are shown.
  • ...and 5 more figures