Large Language Models Meet Virtual Cell: A Survey
Krinos Li, Xianglu Xiao, Shenglong Deng, Lucas He, Zijun Zhong, Yuanjie Zou, Zhonghao Zhan, Zheng Hui, Weiye Bao, Guang Yang
TL;DR
The surveyed work addresses how to make cellular modeling scalable and trustworthy by unifying LLMs under two paradigms: Oracles that directly model cellular states and Agents that orchestrate experiments and analyses. It catalogs diverse architectures and modalities—from DNA/RNA/protein sequence modeling to multi‑omics fusion and text grounding—along with datasets and benchmarks for cellular representation, perturbation prediction, and gene regulation inference. Key contributions include a cohesive taxonomy, a synthesis of datasets (e.g., CELLxGENE, Tabula Sapiens, JUMP‑Cell Painting), evaluation metrics, and a roadmap highlighting scalability, generalizability, and interpretability challenges. The work emphasizes practical impact for drug discovery and personalized medicine by outlining next‑generation AI workflows that couple robust grounding with autonomous scientific exploration.
Abstract
Large language models (LLMs) are transforming cellular biology by enabling the development of "virtual cells"--computational systems that represent, predict, and reason about cellular states and behaviors. This work provides a comprehensive review of LLMs for virtual cell modeling. We propose a unified taxonomy that organizes existing methods into two paradigms: LLMs as Oracles, for direct cellular modeling, and LLMs as Agents, for orchestrating complex scientific tasks. We identify three core tasks--cellular representation, perturbation prediction, and gene regulation inference--and review their associated models, datasets, evaluation benchmarks, as well as the critical challenges in scalability, generalizability, and interpretability.
