Table of Contents
Fetching ...

Same Geometry, Opposite Noise: Transformer Magnitude Representations Lack Scalar Variability

Jon-Paul Cacioli

Abstract

Scalar variability -- the finding that representational noise scales proportionally with magnitude, producing a constant coefficient of variation -- is a hallmark of biological magnitude systems. We tested whether transformer language models exhibit this property by analysing the dispersion of hidden-state representations across carrier sentences for 26 numerical magnitudes in three 7-8B parameter models (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base; data from Cacioli, 2026). We found the opposite: representational variability decreased with magnitude along the magnitude axis (scaling exponent alpha approx -0.19; 0/16 primary layers with alpha > 0, all three models). The negative sign was consistent in full-dimensional space (alpha approx -0.04) and after sentence-identity correction (alpha approx -0.007). The anti-scalar pattern was 3-5x stronger along the magnitude axis than orthogonal dimensions, and corpus frequency strongly predicted per-magnitude variability (rho = .84). These results demonstrate that distributional learning alone is insufficient to produce scalar variability: transformers reproduce log-compressive magnitude geometry but not the constant-CV noise signature observed in biological systems.

Same Geometry, Opposite Noise: Transformer Magnitude Representations Lack Scalar Variability

Abstract

Scalar variability -- the finding that representational noise scales proportionally with magnitude, producing a constant coefficient of variation -- is a hallmark of biological magnitude systems. We tested whether transformer language models exhibit this property by analysing the dispersion of hidden-state representations across carrier sentences for 26 numerical magnitudes in three 7-8B parameter models (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base; data from Cacioli, 2026). We found the opposite: representational variability decreased with magnitude along the magnitude axis (scaling exponent alpha approx -0.19; 0/16 primary layers with alpha > 0, all three models). The negative sign was consistent in full-dimensional space (alpha approx -0.04) and after sentence-identity correction (alpha approx -0.007). The anti-scalar pattern was 3-5x stronger along the magnitude axis than orthogonal dimensions, and corpus frequency strongly predicted per-magnitude variability (rho = .84). These results demonstrate that distributional learning alone is insufficient to produce scalar variability: transformers reproduce log-compressive magnitude geometry but not the constant-CV noise signature observed in biological systems.

Paper Structure

This paper contains 9 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Representational variability as a function of numerical magnitude (log--log axes) at layer 16 of Llama-3-8B-Instruct. Left: raw $V_\text{eucl}$. Right: sentence-corrected $V_\text{residual}$. Red line: OLS fit. Grey dotted line: scalar prediction ($\alpha = 1$). Data cluster near constant or slightly declining variability, far below the scalar prediction.
  • Figure 2: Corpus frequency predicts representational variability. Each point is one of 26 numerical magnitudes at layer 16 of Llama-3-8B-Instruct. Numbers appearing more frequently in training data show greater representational dispersion. Spearman $\rho = .882$, $p < 10^{-9}$. Labels indicate selected magnitudes.
  • Figure 3: E4: On-axis (PC1, magnitude direction) vs off-axis scaling exponent across primary layers for Llama-3-8B-Base. The anti-scalar pattern is $\sim$3$\times$ stronger along the magnitude axis (mean $\alpha = -0.169$) than orthogonal dimensions (mean $\alpha = -0.060$). Wilcoxon $p < .001$. All models showed the same dissociation.
  • Figure 4: E6: Instruction tuning amplifies the anti-scalar pattern. Left: layerwise $\alpha$ for Llama-Instruct (mean $= -0.046$) vs Llama-Base (mean $= -0.031$). Right: per-layer difference ($\Delta\alpha < 0$ at all 16 layers, Wilcoxon $p < .001$).
  • Figure 5: Scaling exponent $\alpha$ across all layers for all three models. Left: $V_\text{eucl}$ (raw). Centre: $V_\text{residual}$ (sentence-corrected). Right: $V_\text{proj}$ (magnitude axis). Dashed grey line: scalar prediction ($\alpha = 1$). Dotted grey line: $\alpha = 0$. All models show $\alpha < 0$ at all primary layers across all measures.