Table of Contents
Fetching ...

scAgent: Universal Single-Cell Annotation via a LLM Agent

Yuren Mao, Yu Mi, Peigen Liu, Mengfei Zhang, Hanqing Liu, Yunjun Gao

TL;DR

scAgent tackles universal cell annotation by integrating an LLM-driven planning module, a modular action space with MoE-LoRA plugins, and a dynamic memory system to generalize across tissues, discover novel cell types, and incrementally learn new categories with limited data. The framework enables data-efficient cross-tissue CTA and robust novel cell detection under batch effects, demonstrated across 35 tissues and 160 cell types with state-of-the-art performance. Key contributions include the three-part agent architecture, dual-embedding novel cell detection, and efficient incremental learning via modular plugins, all validated on large scRNA-seq datasets. The approach holds practical impact for scalable, cross-tissue cellular annotation and future multi-omic and spatial extensions.

Abstract

Cell type annotation is critical for understanding cellular heterogeneity. Based on single-cell RNA-seq data and deep learning models, good progress has been made in annotating a fixed number of cell types within a specific tissue. However, universal cell annotation, which can generalize across tissues, discover novel cell types, and extend to novel cell types, remains less explored. To fill this gap, this paper proposes scAgent, a universal cell annotation framework based on Large Language Models (LLMs). scAgent can identify cell types and discover novel cell types in diverse tissues; furthermore, it is data efficient to learn novel cell types. Experimental studies in 160 cell types and 35 tissues demonstrate the superior performance of scAgent in general cell-type annotation, novel cell discovery, and extensibility to novel cell type.

scAgent: Universal Single-Cell Annotation via a LLM Agent

TL;DR

scAgent tackles universal cell annotation by integrating an LLM-driven planning module, a modular action space with MoE-LoRA plugins, and a dynamic memory system to generalize across tissues, discover novel cell types, and incrementally learn new categories with limited data. The framework enables data-efficient cross-tissue CTA and robust novel cell detection under batch effects, demonstrated across 35 tissues and 160 cell types with state-of-the-art performance. Key contributions include the three-part agent architecture, dual-embedding novel cell detection, and efficient incremental learning via modular plugins, all validated on large scRNA-seq datasets. The approach holds practical impact for scalable, cross-tissue cellular annotation and future multi-omic and spatial extensions.

Abstract

Cell type annotation is critical for understanding cellular heterogeneity. Based on single-cell RNA-seq data and deep learning models, good progress has been made in annotating a fixed number of cell types within a specific tissue. However, universal cell annotation, which can generalize across tissues, discover novel cell types, and extend to novel cell types, remains less explored. To fill this gap, this paper proposes scAgent, a universal cell annotation framework based on Large Language Models (LLMs). scAgent can identify cell types and discover novel cell types in diverse tissues; furthermore, it is data efficient to learn novel cell types. Experimental studies in 160 cell types and 35 tissues demonstrate the superior performance of scAgent in general cell-type annotation, novel cell discovery, and extensibility to novel cell type.

Paper Structure

This paper contains 19 sections, 16 equations, 7 figures.

Figures (7)

  • Figure 1: Overview of scAgent.a Simulation of various user queries. scAgent can generate proper answers according to different user requests, including cell type annotation, novel cell detection and extension to novel type. b The planning module of scAgent. The planning module receives user query and generates a plan as output. The planning process is primarily driven by LLMs(DeepSeek-R1 671B), and assisted by the tools from the action space and the information from memory module, which are illustrated as black icons on the circular arrow. The generated plan determines the action sequence. In the action sequence, the black icons represent a certain category of tools or memory, which can be found in the action space and the memory module. The green arrows denote the interaction with the action space, while the dark blue arrows refers to the memory module. The white box signifies an action, achieved through the collaboration of one or multiple tools, and integrated with memory as needed. c The composition of action space. scAgent employs scGPT(pre-trained on 33 million cells) as the foundational scRNA model while maintaining extensibility to other deep learning models. There are over 30 MoE-LoRA plugins of specific tissues. The embedding analysis tools consists of outlier detection and embedding comparison. Through the analysis of outliers and by comparing the input embeddings with the embeddings stored in memory, these tools can assist with cell annotation tasks and the discovery of novel cell types. The incremental training tools includes training and data update tools. The data update tool merges the original data and new data, also support the update of datasets and other corresponding information in the memory module. The incremental training tool can continually train the MoE-LoRA plugins. d The information in the memory module. The published datasets are stored in the memory module for model training. Embeddings are categorized as LoRA-enhanced and standard, which refer to hidden states that generated by scRNA models with or without MoE-LoRA plugins respectively. System history includes query logs, tool execution sequence and cache, which can help with efficiently planning.
  • Figure 2: Cross-tissue CTA results of scAgent.a,b scAgent ranks first on accuracy, weighted F1-score, and macro F1-score compared to other CTA methods on CG dataset (a) and TS dataset (b). The bars in the bar chart are arranged in order from highest to lowest. c scAgent can specify diverse cell types. In the confusion matrix, each row represents the true cell type in CG reference dataset, and each column represents the predicted cell type by scAgent. The color coding for the cell types is provided in the legend below. The values in the confusion matrix have been normalized by row, such that each value represents the recall rate for the corresponding true cell type. d,e scAgent shows superior tissue-specific performance on CG dataset (d) and TS dataset (e). Each vertex of the radar chart represents a specific tissue, and the length of the axis indicates the weighted F1-score for cell annotation performance on this tissue. f scAgent captures the distinctive features of diverse cell types. Compared to scGPT and scTab (10X data), the UMAP visualization of scAgent on CG reference dataset shows greater distances between cell clusters, demonstrating its superior feature extraction capability.
  • Figure 3: Performance of scAgent in novel cell detection and batch effect correction.a UMAP visualization of the raw feature space for the Liver Breast Cancer and Kidney ccRCC datasets. Novel cells are clustered separately from normal cells and are highlighted by red bounding boxes. b scAgent labels most of novel cells as unknown. Heatmap shows proportion of cells in each row with original label O (shown on the right) predicted as cell type P (shown on the top). c UMAP visualization of the raw feature space for data from different batches. The same cell type from different batches is widely separated, indicating significant batch effects. d UMAP visualization of features of different dataset batches, provided by scAgent. The same cell type from different batches is clustered, demonstrating the effectiveness of scAgent in reducing batch effects. e UMAP visualization of features of Liver Breast Cancer (left panel, marked by *) and Kidney ccRCC (right panel, marked by #) compared to CG reference data, provided by scAgent. Novel cells (red circled) in the liver are well-separated from reference data, while those in the kidney overlap with leukocytes and macrophages, making detection more challenging. f Novel cell detection accuracy of scAgent and three other methods (threshold, OpenMax, DOC) on Liver Breast Cancer and Kidney ccRCC datasets. scAgent outperforms other methods consistently.
  • Figure 4: Results of incremental training in scAgent.a The UMAP visualization is generated based on the embeddings of reference data from liver tissue and malignant tumor cells prior to incremental training. The malignant tumor cells, highlighted with red bounding boxes, are classified as "unknown" by scAgent since they were not present in the reference data. b Following incremental learning with a limited number of labeled samples, scAgent successfully annotates these previously unknown cells with their ground truth labels (indicated by red bounding boxes). c We vary the number of incremental learning samples (i.e., unknown cells with ground truth labels) from 10 to 50 and evaluate the recognition accuracy for both novel cells and known cells. d The cross-entropy loss on the validation set is plotted after each training epoch for scAgent, demonstrating the model's performance after incremental learning with a few number of labeled data.
  • Figure 5: Prompt Template for planning.
  • ...and 2 more figures