Table of Contents
Fetching ...

ELISA: An Interpretable Hybrid Generative AI Agent for Expression-Grounded Discovery in Single-Cell Genomics

Omar Coser

Abstract

Translating single-cell RNA sequencing (scRNA-seq) data into mechanistic biological hypotheses remains a critical bottleneck, as agentic AI systems lack direct access to transcriptomic representations while expression foundation models remain opaque to natural language. Here we introduce ELISA (Embedding-Linked Interactive Single-cell Agent), an interpretable framework that unifies scGPT expression embeddings with BioBERT-based semantic retrieval and LLM-mediated interpretation for interactive single-cell discovery. An automatic query classifier routes inputs to gene marker scoring, semantic matching, or reciprocal rank fusion pipelines depending on whether the query is a gene signature, natural language concept, or mixture of both. Integrated analytical modules perform pathway activity scoringacross 60+ gene sets, ligand--receptor interaction prediction using 280+ curated pairs, condition-aware comparative analysis, and cell-type proportion estimation all operating directly on embedded data without access to the original count matrix. Benchmarked across six diverse scRNA-seq datasets spanning inflammatory lung disease, pediatric and adult cancers, organoid models, healthy tissue, and neurodevelopment, ELISA significantly outperforms CellWhisperer in cell type retrieval (combined permutation test, $p < 0.001$), with particularly large gains on gene-signature queries (Cohen's $d = 5.98$ for MRR). ELISA replicates published biological findings (mean composite score 0.90) with near-perfect pathway alignment and theme coverage (0.98 each), and generates candidate hypotheses through grounded LLM reasoning, bridging the gap between transcriptomic data exploration and biological discovery. Code available at: https://github.com/omaruno/ELISA-An-AI-Agent-for-Expression-Grounded-Discovery-in-Single-Cell-Genomics.git (If you use ELISA in your research, please cite this work).

ELISA: An Interpretable Hybrid Generative AI Agent for Expression-Grounded Discovery in Single-Cell Genomics

Abstract

Translating single-cell RNA sequencing (scRNA-seq) data into mechanistic biological hypotheses remains a critical bottleneck, as agentic AI systems lack direct access to transcriptomic representations while expression foundation models remain opaque to natural language. Here we introduce ELISA (Embedding-Linked Interactive Single-cell Agent), an interpretable framework that unifies scGPT expression embeddings with BioBERT-based semantic retrieval and LLM-mediated interpretation for interactive single-cell discovery. An automatic query classifier routes inputs to gene marker scoring, semantic matching, or reciprocal rank fusion pipelines depending on whether the query is a gene signature, natural language concept, or mixture of both. Integrated analytical modules perform pathway activity scoringacross 60+ gene sets, ligand--receptor interaction prediction using 280+ curated pairs, condition-aware comparative analysis, and cell-type proportion estimation all operating directly on embedded data without access to the original count matrix. Benchmarked across six diverse scRNA-seq datasets spanning inflammatory lung disease, pediatric and adult cancers, organoid models, healthy tissue, and neurodevelopment, ELISA significantly outperforms CellWhisperer in cell type retrieval (combined permutation test, ), with particularly large gains on gene-signature queries (Cohen's for MRR). ELISA replicates published biological findings (mean composite score 0.90) with near-perfect pathway alignment and theme coverage (0.98 each), and generates candidate hypotheses through grounded LLM reasoning, bridging the gap between transcriptomic data exploration and biological discovery. Code available at: https://github.com/omaruno/ELISA-An-AI-Agent-for-Expression-Grounded-Discovery-in-Single-Cell-Genomics.git (If you use ELISA in your research, please cite this work).
Paper Structure (89 sections, 8 equations, 4 figures, 9 tables)

This paper contains 89 sections, 8 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of the ELISA architecture. The framework comprises three stages. In data preparation (left), a single-cell dataset undergoes standard preprocessing (normalization, log-transform, highly variable gene selection, PCA, neighbor graph construction, and Leiden clustering), after which per-cluster differential expression statistics are computed, enriched with Gene Ontology (GO) and Reactome terms, and encoded into 768-dimensional semantic embeddings via BioBERT. In parallel, cell-level expression embeddings are generated through scGPT. Both representations are fused into a single serialized embedding file (.pt). In the retrieval and analysis stage (center), a query classifier routes user input---gene signatures, natural language concepts, or mixed queries---to the appropriate pipeline: gene marker scoring, semantic retrieval, or hybrid retrieval via reciprocal rank fusion (RRF). Additional analytical modules perform pathway scoring, ligand--receptor interaction prediction, comparative analysis, and proportion estimation directly on the embedded data. In the interpretation stage (right), all retrieval and analysis outputs are passed to a Groq-hosted LLM (LLaMA 3.1-8B) that generates grounded biological interpretations and structured reports.
  • Figure 2: ELISA outperforms CellWhisperer across six datasets and both query types. Radar plots showing retrieval performance on ontology (Ont) and expression (Exp) queries for each dataset. Each plot displays six axes: Cluster Recall@$k$ at two dataset-adapted cutoffs and Mean Reciprocal Rank (MRR), evaluated separately on ontology and expression queries (see Supplementary Section \ref{['sec:ret_metrics1']} for metric definitions). Higher values (further from center) indicate better performance. Four retrieval modes are compared: CellWhisperer (pink dashed), ELISA Semantic (blue), ELISA scGPT (orange), and ELISA Union (green). The Union mode consistently achieves the largest radar footprint, matching or exceeding CellWhisperer on ontology metrics while substantially outperforming it on expression metrics. ELISA Union significantly outperformed CellWhisperer across all datasets and metrics (combined permutation test, $p < 0.001$; see Table \ref{['tab:retrieval_stats']}).
  • Figure 3: Cell-level UMAP of the cystic fibrosis airway dataset (D1) colored by Cell Ontology annotation. Approximately 96,000 cells are shown across 30 annotated cell types spanning immune (T cells, B cells, NK cells, macrophages, monocytes, dendritic cells, mast cells), epithelial (basal, suprabasal, multiciliated, secretory, goblet, club, ionocyte, neuroendocrine), and stromal (fibroblasts, pericytes, endocardial cells) compartments. Labels are placed at cluster centroids with iterative repulsion to minimize overlap.
  • Figure 4: Expression of HLA-E projected onto the cell-level UMAP of the cystic fibrosis airway dataset (D1). Color intensity (purple gradient) indicates normalized expression level, with non-expressing cells shown in grey. HLA-E is most highly expressed in immune cell clusters, particularly CD8$^+$ T cells and NK cells, consistent with its role as a ligand for the NKG2A inhibitory receptor. Moderate expression is observed across epithelial populations including basal cells, supporting the HLA-E/NKG2A immune checkpoint axis identified by Berg et al.