Table of Contents
Fetching ...

scReader: Prompting Large Language Models to Interpret scRNA-seq Data

Cong Li, Qingqing Long, Yuanchun Zhou, Meng Xiao

TL;DR

scReader tackles cross-species interpretation of scRNA-seq data by integrating large language models with gene-level representations. It initializes gene embeddings from functional descriptions via GPT-3.5 and builds cell representations from a ranked gene sequence, then uses a frozen Llama-13b to perform prompt-informed cell-type classification. Across HUMAN-10k and MOUSE-13k, scReader delivers superior accuracy and robustness compared to GenePT, with larger gains when more gene knowledge is available (human vs mouse). This work demonstrates the feasibility of knowledge-rich LLMs to enrich single-cell analyses and points to promising directions in multi-omics integration and rare cell-type discovery for precision biology.

Abstract

Large language models (LLMs) have demonstrated remarkable advancements, primarily due to their capabilities in modeling the hidden relationships within text sequences. This innovation presents a unique opportunity in the field of life sciences, where vast collections of single-cell omics data from multiple species provide a foundation for training foundational models. However, the challenge lies in the disparity of data scales across different species, hindering the development of a comprehensive model for interpreting genetic data across diverse organisms. In this study, we propose an innovative hybrid approach that integrates the general knowledge capabilities of LLMs with domain-specific representation models for single-cell omics data interpretation. We begin by focusing on genes as the fundamental unit of representation. Gene representations are initialized using functional descriptions, leveraging the strengths of mature language models such as LLaMA-2. By inputting single-cell gene-level expression data with prompts, we effectively model cellular representations based on the differential expression levels of genes across various species and cell types. In the experiments, we constructed developmental cells from humans and mice, specifically targeting cells that are challenging to annotate. We evaluated our methodology through basic tasks such as cell annotation and visualization analysis. The results demonstrate the efficacy of our approach compared to other methods using LLMs, highlighting significant improvements in accuracy and interoperability. Our hybrid approach enhances the representation of single-cell data and offers a robust framework for future research in cross-species genetic analysis.

scReader: Prompting Large Language Models to Interpret scRNA-seq Data

TL;DR

scReader tackles cross-species interpretation of scRNA-seq data by integrating large language models with gene-level representations. It initializes gene embeddings from functional descriptions via GPT-3.5 and builds cell representations from a ranked gene sequence, then uses a frozen Llama-13b to perform prompt-informed cell-type classification. Across HUMAN-10k and MOUSE-13k, scReader delivers superior accuracy and robustness compared to GenePT, with larger gains when more gene knowledge is available (human vs mouse). This work demonstrates the feasibility of knowledge-rich LLMs to enrich single-cell analyses and points to promising directions in multi-omics integration and rare cell-type discovery for precision biology.

Abstract

Large language models (LLMs) have demonstrated remarkable advancements, primarily due to their capabilities in modeling the hidden relationships within text sequences. This innovation presents a unique opportunity in the field of life sciences, where vast collections of single-cell omics data from multiple species provide a foundation for training foundational models. However, the challenge lies in the disparity of data scales across different species, hindering the development of a comprehensive model for interpreting genetic data across diverse organisms. In this study, we propose an innovative hybrid approach that integrates the general knowledge capabilities of LLMs with domain-specific representation models for single-cell omics data interpretation. We begin by focusing on genes as the fundamental unit of representation. Gene representations are initialized using functional descriptions, leveraging the strengths of mature language models such as LLaMA-2. By inputting single-cell gene-level expression data with prompts, we effectively model cellular representations based on the differential expression levels of genes across various species and cell types. In the experiments, we constructed developmental cells from humans and mice, specifically targeting cells that are challenging to annotate. We evaluated our methodology through basic tasks such as cell annotation and visualization analysis. The results demonstrate the efficacy of our approach compared to other methods using LLMs, highlighting significant improvements in accuracy and interoperability. Our hybrid approach enhances the representation of single-cell data and offers a robust framework for future research in cross-species genetic analysis.

Paper Structure

This paper contains 9 sections, 4 equations, 6 figures.

Figures (6)

  • Figure 1: The illustration of scReader. (a) Details of generating gene embedding via NCBI gene description. (b) The pipeline of scInterpreter. The model will first embed each input from the cell and downstream task-specific instruction. Then, the cell embedding and instruction embedding will pass through the LLMs. After aggregating the knowledge and structural information of the given cell by LLMs, the model ReadOut the representation and then conducts the downstream task.
  • Figure 2: The performance comparison between scReader and GenePT
  • Figure 3: The confusion matrix of each method on MOUSE-13k.
  • Figure 4: The confusion matrix of each method on HUMAN-10k.
  • Figure 5: The UMAP illustration of the cell representation from initialization, GenePT, and scReader on MOUSE-13k.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Definition 1: Gene Description
  • Definition 2: Large Language Model
  • Definition 3: Single-cell RNA sequencing Data
  • Definition 4: Cell Type Annotation Task