Table of Contents
Fetching ...

Towards aligned body representations in vision models

Andrey Gizdov, Andrea Procopio, Yichen Li, Daniel Harari, Tomer Ullman

TL;DR

This work asks whether vision-models trained for segmentation develop human-like coarse body representations essential for intuitive physics. It adapts a 50-participant psychophysical task to seven segmentation architectures spanning a wide capacity range and analyzes representations using a change-detection metric, RAC_seg. The key finding is that smaller, resource-constrained models naturally form coarse, convex-like encodings whereas larger models retain finer geometric detail, suggesting coarse representations emerge under efficiency pressures. These results position machine models as scalable probes of human physical reasoning and inform alignment and interpretability efforts in AI systems.

Abstract

Human physical reasoning relies on internal "body" representations - coarse, volumetric approximations that capture an object's extent and support intuitive predictions about motion and physics. While psychophysical evidence suggests humans use such coarse representations, their internal structure remains largely unknown. Here we test whether vision models trained for segmentation develop comparable representations. We adapt a psychophysical experiment conducted with 50 human participants to a semantic segmentation task and test a family of seven segmentation networks, varying in size. We find that smaller models naturally form human-like coarse body representations, whereas larger models tend toward overly detailed, fine-grain encodings. Our results demonstrate that coarse representations can emerge under limited computational resources, and that machine representations can provide a scalable path toward understanding the structure of physical reasoning in the brain.

Towards aligned body representations in vision models

TL;DR

This work asks whether vision-models trained for segmentation develop human-like coarse body representations essential for intuitive physics. It adapts a 50-participant psychophysical task to seven segmentation architectures spanning a wide capacity range and analyzes representations using a change-detection metric, RAC_seg. The key finding is that smaller, resource-constrained models naturally form coarse, convex-like encodings whereas larger models retain finer geometric detail, suggesting coarse representations emerge under efficiency pressures. These results position machine models as scalable probes of human physical reasoning and inform alignment and interpretability efforts in AI systems.

Abstract

Human physical reasoning relies on internal "body" representations - coarse, volumetric approximations that capture an object's extent and support intuitive predictions about motion and physics. While psychophysical evidence suggests humans use such coarse representations, their internal structure remains largely unknown. Here we test whether vision models trained for segmentation develop comparable representations. We adapt a psychophysical experiment conducted with 50 human participants to a semantic segmentation task and test a family of seven segmentation networks, varying in size. We find that smaller models naturally form human-like coarse body representations, whereas larger models tend toward overly detailed, fine-grain encodings. Our results demonstrate that coarse representations can emerge under limited computational resources, and that machine representations can provide a scalable path toward understanding the structure of physical reasoning in the brain.

Paper Structure

This paper contains 9 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: Stimuli vs Body representations (dotted lines) in (A) humans and (B) vision models.
  • Figure 2: (a) Change detection experiment: Humans (left) vs. Model (right). (b) A small local piece added to one of three locations: Nofill, Concave, and Convex body parts.
  • Figure 3: Mean $\mathrm{RAC}_{\text{seg}}$ during fine-tuning per category across models.
  • Figure 4: Mask overlays (first row) and probability heatmaps after 10 epochs of training across models.