Table of Contents
Fetching ...

The cell as a token: high-dimensional geometry in language models and cell embeddings

William Gilpin

TL;DR

The paper builds a conceptual bridge between high-dimensional language embeddings and single-cell embeddings used in virtual cell models, arguing that context shapes embedding geometry in both domains via a distributional hypothesis and low-dimensional manifolds. It surveys static versus dynamic embeddings, polysemy, and manifold structure, and extends these ideas to cross-lingual alignment and universal, multimodal embeddings that span species and data modalities. The authors advocate mechanistic interpretability and topological analyses to probe embedding geometry, and discuss inference-time reasoning and zero-shot capabilities as a frontier for single-cell biology. The work outlines a framework for designing more informative cell atlases and robust virtual cell models, with potential for improved interpretability, cross-species integration, and predictive regulatory insight, while acknowledging substantial limitations of applying language analogies to biology.

Abstract

Single-cell sequencing technology maps cells to a high-dimensional space encoding their internal activity. Recently-proposed virtual cell models extend this concept, enriching cells' representations based on patterns learned from pretraining on vast cell atlases. This review explores how advances in understanding the structure of natural language embeddings informs ongoing efforts to analyze single-cell datasets. Both fields process unstructured data by partitioning datasets into tokens embedded within a high-dimensional vector space. We discuss how the context of tokens influences the geometry of embedding space, and how low-dimensional manifolds shape this space's robustness and interpretation. We highlight how new developments in foundation models for language, such as interpretability probes and in-context reasoning, can inform efforts to construct cell atlases and train virtual cell models.

The cell as a token: high-dimensional geometry in language models and cell embeddings

TL;DR

The paper builds a conceptual bridge between high-dimensional language embeddings and single-cell embeddings used in virtual cell models, arguing that context shapes embedding geometry in both domains via a distributional hypothesis and low-dimensional manifolds. It surveys static versus dynamic embeddings, polysemy, and manifold structure, and extends these ideas to cross-lingual alignment and universal, multimodal embeddings that span species and data modalities. The authors advocate mechanistic interpretability and topological analyses to probe embedding geometry, and discuss inference-time reasoning and zero-shot capabilities as a frontier for single-cell biology. The work outlines a framework for designing more informative cell atlases and robust virtual cell models, with potential for improved interpretability, cross-species integration, and predictive regulatory insight, while acknowledging substantial limitations of applying language analogies to biology.

Abstract

Single-cell sequencing technology maps cells to a high-dimensional space encoding their internal activity. Recently-proposed virtual cell models extend this concept, enriching cells' representations based on patterns learned from pretraining on vast cell atlases. This review explores how advances in understanding the structure of natural language embeddings informs ongoing efforts to analyze single-cell datasets. Both fields process unstructured data by partitioning datasets into tokens embedded within a high-dimensional vector space. We discuss how the context of tokens influences the geometry of embedding space, and how low-dimensional manifolds shape this space's robustness and interpretation. We highlight how new developments in foundation models for language, such as interpretability probes and in-context reasoning, can inform efforts to construct cell atlases and train virtual cell models.

Paper Structure

This paper contains 8 sections, 3 figures.

Figures (3)

  • Figure 1: Low-rank structure in high-dimensional embeddings. (A) An embedding of the full text of the novel Blood Meridian using a word2vec model originally trained on a dataset of $10^{11}$ words drawn from Google News articles mikolov2013distributed. Vectors are clustered using K-means partitioning, and then summarized into metagroups with a topic embedding model (colors and annotations). (B) An embedding of $6 \times 10^4$ human peripheral blood mononuclear cells based on single-cell RNA sequencing of $1.6 \times 10^4$ genes. Colors correspond to immune cell subtypes, as determined by marker genes for characteristic cell surface proteins like CD4, CD8, etc.
  • Figure 2: Analogies and low-dimensional manifolds. (A) Embeddings of particular sequences of tokens using the model of Fig. \ref{['fig:embeddings']}, with examples of escalating manifolds (red and blue lines), which overlap in regions with similar meaning (weak polysemy). A token with strong polysemy appears at an intermediate location (purple circle). An example of an analogy relationship encoded as nearly-congruent difference vectors (turquoise arrows). While nonlinear embedding methods like UMAP distort the local metric over large scales chari2023specious, the nearby position of the two analogy vectors' heads and tails protects their congruency. (B) RNA Velocity applied to developing endocrine cells in the pancreas bastidas2019comprehensivela2018rna. Vectors correspond to development direction, and color corresponds to pseudotime assigned via diffusion components. Cell types along the differentiation axis are overlaid.
  • Figure 3: Mechanistic interpretability in single-cell foundation models. (A) Common architectural features and target tasks for single-cell foundation models. (B) Mechanistic interpretability methods for single cell embeddings. (Left) Intrinsic dimensionality may be calculated directly from expression profiles, or from internal activations of the model. Inset shows the intrinsic dimensionality of staged expression profiles from developing mice. Panel adapted from Ref. biondo2024intrinsic. (Right) Sparse autoencoders are trained in an unsupervised manner to reconstruct internal activations of foundation models, by mapping activations to sparse combinations of features in a latent dictionary. Inset shows application of sparse autoencoders to the activations of the Universal Cell Embedding model on a dataset of human bone marrow. The left subpanel corresponds to annotated cell types, while the right corresponds to the decoding of a single latent unit. Panels adapted from Ref. schuster2024can.