Table of Contents
Fetching ...

Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

Yakov Pyotr Shkolnikov

TL;DR

These findings enable a single frozen backbone to function as a multi-task geometric sensor through lightweight probes, without fine-tuning or text generation, and reveal a universal mid-network accuracy peak across all architectures.

Abstract

Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at 6.1 degrees MAE from frozen features, while the best text output achieves only 20.0 degrees -- a 3.3x bottleneck. LoRA fine-tuning (r=16, 2,000 images) narrows this gap to 6.5 degrees, providing evidence for a pathway-training deficit rather than a representational one. Training objective determines accuracy more than architecture: five encoders spanning self-supervised, contrastive, and hybrid paradigms converge to statistically equivalent accuracy (R^2 approximately 0.55, TOST-equivalent at delta=0.03) despite sharing as little as CKA=0.41 representational similarity -- functional convergence without representational convergence. Autoregressive generation damages geometric fidelity, but the damage originates in the generation process, not in language alignment: Qwen2.5-VL's LLM layers actually improve probe accuracy over its raw vision encoder. Layer-wise analysis reveals a universal mid-network accuracy peak across all architectures, with attention heads in layers 18-22 carrying disproportionate geometric signal. These findings enable a single frozen backbone to function as a multi-task geometric sensor through lightweight probes, without fine-tuning or text generation.

Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

TL;DR

These findings enable a single frozen backbone to function as a multi-task geometric sensor through lightweight probes, without fine-tuning or text generation, and reveal a universal mid-network accuracy peak across all architectures.

Abstract

Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at 6.1 degrees MAE from frozen features, while the best text output achieves only 20.0 degrees -- a 3.3x bottleneck. LoRA fine-tuning (r=16, 2,000 images) narrows this gap to 6.5 degrees, providing evidence for a pathway-training deficit rather than a representational one. Training objective determines accuracy more than architecture: five encoders spanning self-supervised, contrastive, and hybrid paradigms converge to statistically equivalent accuracy (R^2 approximately 0.55, TOST-equivalent at delta=0.03) despite sharing as little as CKA=0.41 representational similarity -- functional convergence without representational convergence. Autoregressive generation damages geometric fidelity, but the damage originates in the generation process, not in language alignment: Qwen2.5-VL's LLM layers actually improve probe accuracy over its raw vision encoder. Layer-wise analysis reveals a universal mid-network accuracy peak across all architectures, with attention heads in layers 18-22 carrying disproportionate geometric signal. These findings enable a single frozen backbone to function as a multi-task geometric sensor through lightweight probes, without fine-tuning or text generation.
Paper Structure (50 sections, 4 figures, 17 tables)

This paper contains 50 sections, 4 figures, 17 tables.

Figures (4)

  • Figure 1: Overview. Frozen foundation model features encode continuous geometry (joint angles) with 6.1$^\circ$ MAE via a linear probe, while the text pathway achieves only 20.0$^\circ$, a 3.3$\times$ bottleneck. Adding LoRA fine-tuning (r = 16) partially recovers probe-level accuracy (6.5$^\circ$) through the text pathway.
  • Figure 2: Bootstrap 95% CIs for 13 models on FreiHAND (ConvNeXt-L omitted; see Table \ref{['tab:cross_arch']}). Five models form a TOST equivalence cluster (shaded) at R$^2$$\,{\approx}\,$0.55. DINOv2 falls outside despite being the same architecture family as DINOv3.
  • Figure 3: CKA similarity vs. probing accuracy difference for all 28 pairwise comparisons among eight models (six ViT-L + two ViT-B) on FreiHAND. Spearman $\rho$ = 0.03 ($p$ = 0.88): no detectable correlation between representational similarity and geometric probing accuracy ($n$ = 28). The most similar pair (DINOv2--DINOv3, CKA = 0.88) differs by 0.033 R$^2$; the least similar pair (SigLIP 2--CLIP, CKA = 0.41) differs by only 0.008.
  • Figure 4: Layer-wise R$^2$ on FreiHAND for ten models (solid: vision encoders; dashed: LLM decoders). X-axis is normalized layer depth (0 = first, 1 = last). Vision encoders rise monotonically; LLM decoders peak at early layers and decline, consistent with autoregressive processing discarding fine-grained geometry.