Table of Contents
Fetching ...

What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses

Ihor Kendiukhov

TL;DR

An AI-driven executor-brainstormer loop that proposed, tested, and refined 141 geometric and topological hypotheses across 52 iterations, covering persistent homology, manifold distances, cross-model alignment, community structure, and directed topology, finds that structure is shared across independently trained models and is more localized than it first appears.

Abstract

When biological foundation models such as scGPT and Geneformer process single-cell gene expression, what geometric and topological structure forms in their internal representations? Is that structure biologically meaningful or a training artifact, and how confident should we be in such claims? We address these questions through autonomous large-scale hypothesis screening: an AI-driven executor-brainstormer loop that proposed, tested, and refined 141 geometric and topological hypotheses across 52 iterations, covering persistent homology, manifold distances, cross-model alignment, community structure, and directed topology, all with explicit null controls and disjoint gene-pool splits. Three principal findings emerge. First, the models learn genuine geometric structure. Gene embedding neighborhoods exhibit non-trivial topology, with persistent homology significant in 11 of 12 transformer layers at p < 0.05 in the weakest domain and 12 of 12 in the other two. A multi-level distance hierarchy shows that manifold-aware metrics outperform Euclidean distance for identifying regulatory gene pairs, and graph community partitions track known transcription factor target relationships. Second, this structure is shared across independently trained models. CCA alignment between scGPT and Geneformer yields canonical correlation of 0.80 and gene retrieval accuracy of 72 percent, yet none of 19 tested methods reliably recover gene-level correspondences. The models agree on the global shape of gene space but not on precise gene placement. Third, the structure is more localized than it first appears. Under stringent null controls applied across all null families, robust signal concentrates in immune tissue, while lung and external lung signals weaken substantially.

What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses

TL;DR

An AI-driven executor-brainstormer loop that proposed, tested, and refined 141 geometric and topological hypotheses across 52 iterations, covering persistent homology, manifold distances, cross-model alignment, community structure, and directed topology, finds that structure is shared across independently trained models and is more localized than it first appears.

Abstract

When biological foundation models such as scGPT and Geneformer process single-cell gene expression, what geometric and topological structure forms in their internal representations? Is that structure biologically meaningful or a training artifact, and how confident should we be in such claims? We address these questions through autonomous large-scale hypothesis screening: an AI-driven executor-brainstormer loop that proposed, tested, and refined 141 geometric and topological hypotheses across 52 iterations, covering persistent homology, manifold distances, cross-model alignment, community structure, and directed topology, all with explicit null controls and disjoint gene-pool splits. Three principal findings emerge. First, the models learn genuine geometric structure. Gene embedding neighborhoods exhibit non-trivial topology, with persistent homology significant in 11 of 12 transformer layers at p < 0.05 in the weakest domain and 12 of 12 in the other two. A multi-level distance hierarchy shows that manifold-aware metrics outperform Euclidean distance for identifying regulatory gene pairs, and graph community partitions track known transcription factor target relationships. Second, this structure is shared across independently trained models. CCA alignment between scGPT and Geneformer yields canonical correlation of 0.80 and gene retrieval accuracy of 72 percent, yet none of 19 tested methods reliably recover gene-level correspondences. The models agree on the global shape of gene space but not on precise gene placement. Third, the structure is more localized than it first appears. Under stringent null controls applied across all null families, robust signal concentrates in immune tissue, while lung and external lung signals weaken substantially.
Paper Structure (20 sections, 8 figures, 3 tables)

This paper contains 20 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Distribution of hypothesis outcomes across nine content families (111 of 141 total hypotheses; the remaining 30 are cross-cutting methodological variants---topology stability tests, null-framework development, split-design validation---that span multiple families). Approximately 27 showed positive results under their primary null control, 21 were inconclusive or partial, and 63 were decisively negative. Under the strictest max-null audit (Section \ref{['sec:strict_null']}), fewer than 15 survive, concentrating the robust-positive rate to roughly 10%.
  • Figure 2: Cross-model alignment between scGPT and Geneformer (H24). Bars show observed metrics; dashed lines show null expectations. Across all four alignment metrics and all three tissue domains, observed values substantially exceed null baselines, confirming that the two models converge on similar geometric organization despite independent training.
  • Figure 3: Persistent homology across transformer layers and tissue domains. (a) H1 persistence delta (observed minus null) for each layer, showing that topological structure exceeds null expectations across all domains and most layers, with characteristic peaks in early and middle layers. (b) Statistical significance ($-\log_{10} p$) of the H1 signal; dashed lines mark $p = 0.05$ and $p = 0.01$ thresholds. The immune and external-lung domains show consistently strong significance, while the lung domain is more variable.
  • Figure 4: Geodesic manifold distances outperform Euclidean for regulatory edge discrimination (H13). (a) Source-disjoint and (b) target-disjoint gene pool splits. The shaded area shows $\Delta$AUROC (geodesic minus Euclidean); red stars mark layers where the improvement is statistically significant ($p < 0.05$). The advantage is modest ($\Delta$AUROC $\approx 0.01$) but consistent across splits and concentrated in middle transformer layers.
  • Figure 5: Signed motif--community hardening (H123), the strongest finding across all 141 hypotheses. (a) Effect size ($\Delta$AUROC vs. the H70 geometric baseline) across domain-split groups, showing consistently positive improvement. (b) Null-gap analysis: the observed signal minus the 95th percentile of the null distribution is positive in all test rows---the only hypothesis in the campaign to achieve complete null-gap coverage.
  • ...and 3 more figures