Table of Contents
Fetching ...

A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following

Yin Fang, Xinle Deng, Kangwei Liu, Ningyu Zhang, Jingyang Qian, Penghui Yang, Xiaohui Fan, Huajun Chen

TL;DR

InstructCell is presented, a multi-modal AI copilot that leverages natural language as a medium for more direct and flexible single-cell analysis and provides an accessible and intuitive tool for exploring complex single-cell data, lowering technical barriers and enabling deeper biological insights.

Abstract

Large language models excel at interpreting complex natural language instructions, enabling them to perform a wide range of tasks. In the life sciences, single-cell RNA sequencing (scRNA-seq) data serves as the "language of cellular biology", capturing intricate gene expression patterns at the single-cell level. However, interacting with this "language" through conventional tools is often inefficient and unintuitive, posing challenges for researchers. To address these limitations, we present InstructCell, a multi-modal AI copilot that leverages natural language as a medium for more direct and flexible single-cell analysis. We construct a comprehensive multi-modal instruction dataset that pairs text-based instructions with scRNA-seq profiles from diverse tissues and species. Building on this, we develop a multi-modal cell language architecture capable of simultaneously interpreting and processing both modalities. InstructCell empowers researchers to accomplish critical tasks-such as cell type annotation, conditional pseudo-cell generation, and drug sensitivity prediction-using straightforward natural language commands. Extensive evaluations demonstrate that InstructCell consistently meets or exceeds the performance of existing single-cell foundation models, while adapting to diverse experimental conditions. More importantly, InstructCell provides an accessible and intuitive tool for exploring complex single-cell data, lowering technical barriers and enabling deeper biological insights.

A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following

TL;DR

InstructCell is presented, a multi-modal AI copilot that leverages natural language as a medium for more direct and flexible single-cell analysis and provides an accessible and intuitive tool for exploring complex single-cell data, lowering technical barriers and enabling deeper biological insights.

Abstract

Large language models excel at interpreting complex natural language instructions, enabling them to perform a wide range of tasks. In the life sciences, single-cell RNA sequencing (scRNA-seq) data serves as the "language of cellular biology", capturing intricate gene expression patterns at the single-cell level. However, interacting with this "language" through conventional tools is often inefficient and unintuitive, posing challenges for researchers. To address these limitations, we present InstructCell, a multi-modal AI copilot that leverages natural language as a medium for more direct and flexible single-cell analysis. We construct a comprehensive multi-modal instruction dataset that pairs text-based instructions with scRNA-seq profiles from diverse tissues and species. Building on this, we develop a multi-modal cell language architecture capable of simultaneously interpreting and processing both modalities. InstructCell empowers researchers to accomplish critical tasks-such as cell type annotation, conditional pseudo-cell generation, and drug sensitivity prediction-using straightforward natural language commands. Extensive evaluations demonstrate that InstructCell consistently meets or exceeds the performance of existing single-cell foundation models, while adapting to diverse experimental conditions. More importantly, InstructCell provides an accessible and intuitive tool for exploring complex single-cell data, lowering technical barriers and enabling deeper biological insights.
Paper Structure (11 sections, 20 equations, 13 figures)

This paper contains 11 sections, 20 equations, 13 figures.

Figures (13)

  • Figure 1: Overview of InstructCell.a, Summary of incorporated single-cell data. InstructCell incorporates 299,155 scRNA-seq samples from human and mouse origins, spanning multiple organs. CPCG denotes Conditional Pseudo-cell Generation, CTA denotes Cell Type Annotation, and DSP denotes Drug Sensitivity Prediction. b, Architecture of the multi-modal cell language model. The model processes both text and single-cell data via three primary components: a Q-Former to capture single-cell gene expression knowledge, a pre-trained LM as the backbone, and a cell reconstruction module for generating single-cell gene expression profiles. c, Construction of multi-modal single-cell instruction data. Complete instruction-response pairs are formed by combining required and optional attributes from text and single-cell modalities. d, Simulation of diverse communication styles. LLMs generate chat templates with varying traits (personality, motivation, and proficiency) to produce instructions that convey task-related information in different communication styles.
  • Figure 2: Conditional pseudo-cell generation results by InstructCell.a, UMAP visualizations of real and generated cells. The left plot shows the overlap between real and generated cells. The middle and right plots display real and generated cells, respectively, with distinct colors indicating different cell types. b, Dot plots of gene expression patterns derived from real (top) and generated (bottom) cells. Based on the test set from Tabular-Sapiens, we use Welch's $t$-test to identify top three significant genes for each cell type and display them along x-axis. Cell types are arranged along y-axis. The size of each dot indicates the proportion of single cells within the corresponding cell type that express the gene, while the color of the dot represents the mean expression level of the gene within that cell type. The results of the remaining two datasets are available in Fig.\ref{['fig:bubble_plot']}. c, Quantitative evaluation of cell generation performance across four datasets. A lower $\triangle$sKNN value indicates better structural alignment, a higher pKNN value reflects improved positional correspondence, and a lower MMD value denotes a more accurate approximation of the global data distribution.
  • Figure 3: Cell type annotation results by InstructCell.a, Evaluation of InstructCell's CTA performance across human heart, liver, pancreas, and mouse skin and pancreas datasets. Performance is quantified using weighted F1, macro F1, and accuracy metrics, with different colors representing different models. b, UMAP visualization of three different datasets. The left panel is colored by expert-annotated cell types from the original research, and the right panel is colored by InstructCell prediction results. c, Confusion matrices between predicted cell types and actual annotations for the three datasets. Darker shades denote a higher frequency of agreement between the model's predictions and the actual cell type annotations.
  • Figure 4: Drug sensitivity prediction results by InstructCell.a, Evaluation of InstructCell's CTA performance across human oral, lung, and mouse bone datasets. Performance is quantified using weighted F1, macro F1, and accuracy metrics, with different colors representing different models. b, UMAP visualization of the three datasets, with cells colored by drug sensitivity labels (sensitive, resistant, and holiday) for both expert-annotated results and InstructCell predictions. c, Confusion matrices between predicted cell types and actual annotations for the three datasets. Darker shades denote a higher frequency of agreement between the model's predictions and the actual drug sensitivity annotations.
  • Figure 5: Robustness of InstructCell.a, Quantitative comparison of the CPCG task under seen and unseen instruction templates. Results are shown for $\triangle$sKNN and pKNN metrics at varying numbers of neighbors $K$, as well as for MMD. Different colors denote whether the instruction templates are seen or unseen. b, Average performance of InstructCell under instruct and chat modes across each task. On the left side (classification tasks), the shape of each scatter point indicates whether options are provided or not, while the color distinguishes model versions. Each configuration includes 40 scatter points (20 with options and 20 without). On the right side (generative task), different colors represent different model versions.
  • ...and 8 more figures