The cell as a token: high-dimensional geometry in language models and cell embeddings
William Gilpin
TL;DR
The paper builds a conceptual bridge between high-dimensional language embeddings and single-cell embeddings used in virtual cell models, arguing that context shapes embedding geometry in both domains via a distributional hypothesis and low-dimensional manifolds. It surveys static versus dynamic embeddings, polysemy, and manifold structure, and extends these ideas to cross-lingual alignment and universal, multimodal embeddings that span species and data modalities. The authors advocate mechanistic interpretability and topological analyses to probe embedding geometry, and discuss inference-time reasoning and zero-shot capabilities as a frontier for single-cell biology. The work outlines a framework for designing more informative cell atlases and robust virtual cell models, with potential for improved interpretability, cross-species integration, and predictive regulatory insight, while acknowledging substantial limitations of applying language analogies to biology.
Abstract
Single-cell sequencing technology maps cells to a high-dimensional space encoding their internal activity. Recently-proposed virtual cell models extend this concept, enriching cells' representations based on patterns learned from pretraining on vast cell atlases. This review explores how advances in understanding the structure of natural language embeddings informs ongoing efforts to analyze single-cell datasets. Both fields process unstructured data by partitioning datasets into tokens embedded within a high-dimensional vector space. We discuss how the context of tokens influences the geometry of embedding space, and how low-dimensional manifolds shape this space's robustness and interpretation. We highlight how new developments in foundation models for language, such as interpretability probes and in-context reasoning, can inform efforts to construct cell atlases and train virtual cell models.
