scReader: Prompting Large Language Models to Interpret scRNA-seq Data
Cong Li, Qingqing Long, Yuanchun Zhou, Meng Xiao
TL;DR
scReader tackles cross-species interpretation of scRNA-seq data by integrating large language models with gene-level representations. It initializes gene embeddings from functional descriptions via GPT-3.5 and builds cell representations from a ranked gene sequence, then uses a frozen Llama-13b to perform prompt-informed cell-type classification. Across HUMAN-10k and MOUSE-13k, scReader delivers superior accuracy and robustness compared to GenePT, with larger gains when more gene knowledge is available (human vs mouse). This work demonstrates the feasibility of knowledge-rich LLMs to enrich single-cell analyses and points to promising directions in multi-omics integration and rare cell-type discovery for precision biology.
Abstract
Large language models (LLMs) have demonstrated remarkable advancements, primarily due to their capabilities in modeling the hidden relationships within text sequences. This innovation presents a unique opportunity in the field of life sciences, where vast collections of single-cell omics data from multiple species provide a foundation for training foundational models. However, the challenge lies in the disparity of data scales across different species, hindering the development of a comprehensive model for interpreting genetic data across diverse organisms. In this study, we propose an innovative hybrid approach that integrates the general knowledge capabilities of LLMs with domain-specific representation models for single-cell omics data interpretation. We begin by focusing on genes as the fundamental unit of representation. Gene representations are initialized using functional descriptions, leveraging the strengths of mature language models such as LLaMA-2. By inputting single-cell gene-level expression data with prompts, we effectively model cellular representations based on the differential expression levels of genes across various species and cell types. In the experiments, we constructed developmental cells from humans and mice, specifically targeting cells that are challenging to annotate. We evaluated our methodology through basic tasks such as cell annotation and visualization analysis. The results demonstrate the efficacy of our approach compared to other methods using LLMs, highlighting significant improvements in accuracy and interoperability. Our hybrid approach enhances the representation of single-cell data and offers a robust framework for future research in cross-species genetic analysis.
