Table of Contents
Fetching ...

Human-Like Coarse Object Representations in Vision Models

Andrey Gizdov, Andrea Procopio, Yichen Li, Daniel Harari, Tomer Ullman

TL;DR

This work investigates whether modern segmentation models learn human-like coarse object bodies that support intuitive physics. By pairing human Time-To-Collision judgments with predictions from six SegFormer variants and manipulating training time, model size, and pruning, the authors reveal a non-monotonic, U-shaped alignment: intermediate granularity best matches human concavity effects, while too-rough or too-precise representations diverge. The findings support a resource-rational view: coarse, physics-relevant representations emerge from capacity and computation constraints rather than bespoke biases. Practically, the study offers simple levers—mid-training checkpoints, moderate pruning, and mid-sized architectures—to elicit physics-efficient representations in vision models, with implications for human-AI interaction and safe deployment in dynamic environments.

Abstract

Humans appear to represent objects for intuitive physics with coarse, volumetric bodies'' that smooth concavities - trading fine visual details for efficient physical predictions - yet their internal structure is largely unknown. Segmentation models, in contrast, optimize pixel-accurate masks that may misalign with such bodies. We ask whether and when these models nonetheless acquire human-like bodies. Using a time-to-collision (TTC) behavioral paradigm, we introduce a comparison pipeline and alignment metric, then vary model training time, size, and effective capacity via pruning. Across all manipulations, alignment with human behavior follows an inverse U-shaped curve: small/briefly trained/pruned models under-segment into blobs; large/fully trained models over-segment with boundary wiggles; and an intermediate ideal body granularity'' best matches humans. This suggests human-like coarse bodies emerge from resource constraints rather than bespoke biases, and points to simple knobs - early checkpoints, modest architectures, light pruning - for eliciting physics-efficient representations. We situate these results within resource-rational accounts balancing recognition detail against physical affordances.

Human-Like Coarse Object Representations in Vision Models

TL;DR

This work investigates whether modern segmentation models learn human-like coarse object bodies that support intuitive physics. By pairing human Time-To-Collision judgments with predictions from six SegFormer variants and manipulating training time, model size, and pruning, the authors reveal a non-monotonic, U-shaped alignment: intermediate granularity best matches human concavity effects, while too-rough or too-precise representations diverge. The findings support a resource-rational view: coarse, physics-relevant representations emerge from capacity and computation constraints rather than bespoke biases. Practically, the study offers simple levers—mid-training checkpoints, moderate pruning, and mid-sized architectures—to elicit physics-efficient representations in vision models, with implications for human-AI interaction and safe deployment in dynamic environments.

Abstract

Humans appear to represent objects for intuitive physics with coarse, volumetric bodies'' that smooth concavities - trading fine visual details for efficient physical predictions - yet their internal structure is largely unknown. Segmentation models, in contrast, optimize pixel-accurate masks that may misalign with such bodies. We ask whether and when these models nonetheless acquire human-like bodies. Using a time-to-collision (TTC) behavioral paradigm, we introduce a comparison pipeline and alignment metric, then vary model training time, size, and effective capacity via pruning. Across all manipulations, alignment with human behavior follows an inverse U-shaped curve: small/briefly trained/pruned models under-segment into blobs; large/fully trained models over-segment with boundary wiggles; and an intermediate ideal body granularity'' best matches humans. This suggests human-like coarse bodies emerge from resource constraints rather than bespoke biases, and points to simple knobs - early checkpoints, modest architectures, light pruning - for eliciting physics-efficient representations. We situate these results within resource-rational accounts balancing recognition detail against physical affordances.
Paper Structure (13 sections, 16 equations, 6 figures)

This paper contains 13 sections, 16 equations, 6 figures.

Figures (6)

  • Figure 1: Research overview.(A) Different object representations are useful for different goals. "Body" representation is useful for physical reasoning. "Shape" representation is useful for recognition. (B) Prior work li2023approximate has shown evidence that humans do have such coarse body representations for physical reasoning (e.g. predicting collision times), specifically by "filling in" concave regions. (C) Our research asks whether vision segmentation models have such similar representations and under what conditions they develop them.
  • Figure 2: Method: We compared human behavioral responses with vision model's responses in a time-to-collision task.(A) In prior work, human participants watched short videos showing objects moving towards one another, and were asked to respond as soon as they predicted the collision happened. Time-to-collision (TTC) response were collected from humans, which showed a systematic pattern that object pairs with collision points inside a concavity seemed to collide sooner than they actually were -- an effect of "filling in" concavities on an object due to coarse body approximation. (B) In this paper, we take pre-trained segmentation models to perform the same task and compare each model's TTC responses with humans. We systematically varied different aspects of training to adjust the coarseness of object representations in vision models.
  • Figure 3: Effect of training time on the difference between model behavior and human behavior. (a) Moderately approximate body representation emerges from medium training time, and aligns best with human behavior. Y-axis shows the difference between model and human responses $\bar{E}$. 3 example visualizations of the segmentation mask along training time are shown, from the most coarse (beginning of training), to intermediate coarse (after short training), to very fine-grained (end of training). (b) Larger models require less training time to reach the most human-like body representation. Larger models (e.g. B2-B5) converge to the point with the smallest difference to human behavior faster than smaller models (B0, B1).
  • Figure 4: Moderate pruning of neurons produces more human-like body representations. The x-axis shows the strength of pruning (i.e. percentage of neurons removed). The y-axis shows the difference between model behavior and human behavior $\bar{E}$.
  • Figure 5: Models of intermediate sizes behave more human-like than small or large models, after a fixed number of training steps. The x-axis shows different model sizes. The y-axis shows the difference between model behavior and human behavior. Model size goes from small (B0) to large (B5).
  • ...and 1 more figures