Table of Contents
Fetching ...

The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

Prashant C. Raju

Abstract

Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2's reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.

The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

Abstract

Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2's reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.

Paper Structure

This paper contains 176 sections, 11 equations, 5 figures, 36 tables.

Figures (5)

  • Figure 1: A. Track A vs. Track B Lipschitz profiles: smooth arcs (continuous physics) vs. divergent, multi-scale fracture (discrete biology). B. Continuous vs. discrete Procrustes D across architectures on the Lorenz dataset at 1% noise. All continuous conditions cluster near zero; discrete conditions span an order of magnitude. C. VQ double bind: reconstruction MSE (decreasing) vs. Procrustes D (non-monotone) vs. codebook size K, with 1/log(K) fit overlaid.
  • Figure 2: A. ESM-2 composite stability (blue, left axis) vs. parameters, with Procrustes reduction overlaid (orange, right axis). Stability declines monotonically from 8M to 3B; the 15B "recovery" is unmasked by the simultaneous spike in Procrustes reduction, revealing global manifold drift rather than genuine geometric improvement. B. Conceptual illustration of the two failure modes. Ground Truth: the manifold is anchored and internally cohesive. Brittle Glass (small/medium Transformers): high Procrustes error with low internal RDM, the manifold fractures internally. Untethered Gel (large Transformers): high Procrustes error with high internal RDM, the manifold drifts as a coherent block.
  • Figure 3: A. Texture Hypothesis Test. RC RDM similarity across four conditions for Evo 2 (7B, 8K context, 10,000 sequences). Dinuc-shuffled real DNA (per-sequence $k$-mer counts preserved) recovers $97\%$ of the real-random gap; texture-matched Markov (population-level statistics only) recovers $3\%$. B. The RC Dissociation explained. On synthetic DNA (left), discrete tokens destroy the $\mathrm{A}{\leftrightarrow}\mathrm{T}$ / $\mathrm{C}{\leftrightarrow}\mathrm{G}$ bijection entirely (RDM ${\sim}\,0.15$). On real genomic DNA (right), conserved macroscopic $k$-mer composition creates overlapping texture between forward and RC embeddings (RDM ${\sim}\,0.88$), masking the failure. High RDM on real DNA reflects statistical artifact, not learned equivariance.
  • Figure 4: (A) Excess MI (bias-corrected) across the three failure regimes. ProtMamba falls below zero (Geometric Vacuity), ESM-1b and OpenFold show large positive values (Representational Compression), and Evo 2 is modest and positive (Local-Global Decoupling). Random baselines sit at zero by construction. (B) Regime I: Evo 2 global vs. local MI. The flat curve across $64\times$ context expansion confirms informational shallowness. (C) Regime II: ESM-1b vs. Evoformer excess MI at each sequence length, with Procrustes disparity annotated. The Evoformer amplifies MI while warping the manifold.
  • Figure 5: Effect of embedding-level RCCR on DNABERT-2 (117M). (A) Training loss converges rapidly ($99.4\%$ reduction in 10 epochs). (B) Per-sequence RC cosine gap collapses from $0.041$ to $0.000$: perfect pointwise consistency. (C) Despite this, Procrustes disparity between forward and RC embedding matrices increases$91\%$ ($0.76 \to 1.45$): population-level geometric structure degrades. (D) Shesha composite stability by perturbation type. RCCR composites are marginally higher, driven by improved feature-split scores; RDM similarity and perturbation magnitude both degrade (Table \ref{['tab:rccr']}). (E) Delta composite (RCCR minus baseline). Uniform positive shift masks underlying geometric deterioration. RCCR achieves consistency by flattening perturbation sensitivity, not by aligning manifold geometry.