Table of Contents
Fetching ...

scInterpreter: Training Large Language Models to Interpret scRNA-seq Data for Cell Type Annotation

Cong Li, Meng Xiao, Pengfei Wang, Guihai Feng, Xin Li, Yuanchun Zhou

TL;DR

This paper tackles the challenge of annotating cell types from scRNA-seq data using large language models by introducing scInterpreter, which leverages descriptive text to generate gene embeddings and an instruction-guided LLM readout to predict cell types. The method decomposes into gene-level embedding initialization from NCBI descriptions via GPT-3.5 and a frozen Llama-13b-based interpreter that projects top-gene embeddings and uses a ReadOut head for classification. Empirical results on HUMAN-10k and MOUSE-13k show scInterpreter significantly outperforms GenePT across standard metrics, with visualization evidence from UMAP confirming improved embedding separability. This work highlights the potential of integrating LLMs’ broad knowledge as supervisory signals to enhance biological interpretation of single-cell data, opening avenues for discovering new cellular insights.

Abstract

Despite the inherent limitations of existing Large Language Models in directly reading and interpreting single-cell omics data, they demonstrate significant potential and flexibility as the Foundation Model. This research focuses on how to train and adapt the Large Language Model with the capability to interpret and distinguish cell types in single-cell RNA sequencing data. Our preliminary research results indicate that these foundational models excel in accurately categorizing known cell types, demonstrating the potential of the Large Language Models as effective tools for uncovering new biological insights.

scInterpreter: Training Large Language Models to Interpret scRNA-seq Data for Cell Type Annotation

TL;DR

This paper tackles the challenge of annotating cell types from scRNA-seq data using large language models by introducing scInterpreter, which leverages descriptive text to generate gene embeddings and an instruction-guided LLM readout to predict cell types. The method decomposes into gene-level embedding initialization from NCBI descriptions via GPT-3.5 and a frozen Llama-13b-based interpreter that projects top-gene embeddings and uses a ReadOut head for classification. Empirical results on HUMAN-10k and MOUSE-13k show scInterpreter significantly outperforms GenePT across standard metrics, with visualization evidence from UMAP confirming improved embedding separability. This work highlights the potential of integrating LLMs’ broad knowledge as supervisory signals to enhance biological interpretation of single-cell data, opening avenues for discovering new cellular insights.

Abstract

Despite the inherent limitations of existing Large Language Models in directly reading and interpreting single-cell omics data, they demonstrate significant potential and flexibility as the Foundation Model. This research focuses on how to train and adapt the Large Language Model with the capability to interpret and distinguish cell types in single-cell RNA sequencing data. Our preliminary research results indicate that these foundational models excel in accurately categorizing known cell types, demonstrating the potential of the Large Language Models as effective tools for uncovering new biological insights.
Paper Structure (9 sections, 3 equations, 4 figures)

This paper contains 9 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: ThepipelineofscInterpreter.Themodelwillfirstembedeachinputfromthecellanddownstreamtask-specificinstruction.ThenthecellembeddingandinstructionembeddingwillpassthroughtheLLMs.AfteraggregatingtheknowledgeandstructuralinformationofthegivencellbyLLMs,themodelReadOuttherepresentationandthenconductsthedownstreamtask.
  • Figure 2: TheperformancecomparisonbetweenscInterpreterandGenePT
  • Figure 3: TheconfusionmatrixofeachmethodonMOUSE-13k.
  • Figure 4: TheUMAPillustrationofthecellrepresentationfrominitialization,GenePT,andscInterpreteronMOUSE-13k.