Table of Contents
Fetching ...

Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations

Ihor Kendiukhov

TL;DR

The results indicate that biological transformers learn an interpretable internal model of cellular organization, with implications for regulatory network inference, drug target prioritization, and model auditing.

Abstract

Single-cell foundation models such as scGPT learn high-dimensional gene representations, but what biological knowledge these representations encode remains unclear. We systematically decode the geometric structure of scGPT internal representations through 63 iterations of automated hypothesis screening (183 hypotheses tested), revealing that the model organizes genes into a structured biological coordinate system rather than an opaque feature space. The dominant spectral axis separates genes by subcellular localization, with secreted proteins at one pole and cytosolic proteins at the other. Intermediate transformer layers transiently encode mitochondrial and ER compartments in a sequence that mirrors the cellular secretory pathway. Orthogonal axes encode protein-protein interaction networks with graded fidelity to experimentally measured interaction strength (Spearman rho = 1.000 across n = 5 STRING confidence quintiles, p = 0.017). In a compact six-dimensional spectral subspace, the model distinguishes transcription factors from their target genes (AUROC = 0.744, all 12 layers significant). Early layers preserve which specific genes regulate which targets, while deeper layers compress this into a coarser regulator versus regulated distinction. Repression edges are geometrically more prominent than activation edges, and B-cell master regulators BATF and BACH2 show convergence toward the B-cell identity anchor PAX5 across transformer depth. Cell-type marker genes cluster with high fidelity (AUROC = 0.851). Residual-stream geometry encodes biological structure complementary to attention patterns. These results indicate that biological transformers learn an interpretable internal model of cellular organization, with implications for regulatory network inference, drug target prioritization, and model auditing.

Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations

TL;DR

The results indicate that biological transformers learn an interpretable internal model of cellular organization, with implications for regulatory network inference, drug target prioritization, and model auditing.

Abstract

Single-cell foundation models such as scGPT learn high-dimensional gene representations, but what biological knowledge these representations encode remains unclear. We systematically decode the geometric structure of scGPT internal representations through 63 iterations of automated hypothesis screening (183 hypotheses tested), revealing that the model organizes genes into a structured biological coordinate system rather than an opaque feature space. The dominant spectral axis separates genes by subcellular localization, with secreted proteins at one pole and cytosolic proteins at the other. Intermediate transformer layers transiently encode mitochondrial and ER compartments in a sequence that mirrors the cellular secretory pathway. Orthogonal axes encode protein-protein interaction networks with graded fidelity to experimentally measured interaction strength (Spearman rho = 1.000 across n = 5 STRING confidence quintiles, p = 0.017). In a compact six-dimensional spectral subspace, the model distinguishes transcription factors from their target genes (AUROC = 0.744, all 12 layers significant). Early layers preserve which specific genes regulate which targets, while deeper layers compress this into a coarser regulator versus regulated distinction. Repression edges are geometrically more prominent than activation edges, and B-cell master regulators BATF and BACH2 show convergence toward the B-cell identity anchor PAX5 across transformer depth. Cell-type marker genes cluster with high fidelity (AUROC = 0.851). Residual-stream geometry encodes biological structure complementary to attention patterns. These results indicate that biological transformers learn an interpretable internal model of cellular organization, with implications for regulatory network inference, drug target prioritization, and model auditing.
Paper Structure (48 sections, 5 figures, 7 tables)

This paper contains 48 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Spectral structure and biological organization across transformer depth.(a)$\mathrm{SV}_{1}$ variance fraction (195-gene submatrix) increases from 19% (L0) to 77% (L11) as the model progressively concentrates gene representations onto the secretory/localization axis. (b) TRRUST regulatory pairs co-localize in $\mathrm{SV}_{2}$ poles above null at all layers, indicating that the model embeds co-regulated genes nearby in its internal geometry.
  • Figure 2: A compact six-dimensional subspace separates transcription factors from targets. Joint $\mathrm{SV}_{2}$--$\mathrm{SV}_{7}$ (green) outperforms individual subspaces at nearly all layers. The complementary depth profiles of $\mathrm{SV}_{5}$--$\mathrm{SV}_{7}$ (early-dominant, orange) and $\mathrm{SV}_{2}$--$\mathrm{SV}_{4}$ (mid-depth, blue) ensure regulatory information is never absent.
  • Figure 3: Cross-seed robustness. The joint classifier maintains AUROC $> 0.65$ across all layer--seed combinations (three independent random seeds).
  • Figure 4: Edge-level regulatory geometry peaks at early layers and decays with depth.$\mathrm{SV}_{5}$--$\mathrm{SV}_{7}$ (orange) carries edge-level signal at layers 0--8; $\mathrm{SV}_{2}$--$\mathrm{SV}_{4}$ (blue) is near chance.
  • Figure 5: Repression edges are geometrically more prominent than activation edges in both spectral subspaces.