Human-Like Coarse Object Representations in Vision Models
Andrey Gizdov, Andrea Procopio, Yichen Li, Daniel Harari, Tomer Ullman
TL;DR
This work investigates whether modern segmentation models learn human-like coarse object bodies that support intuitive physics. By pairing human Time-To-Collision judgments with predictions from six SegFormer variants and manipulating training time, model size, and pruning, the authors reveal a non-monotonic, U-shaped alignment: intermediate granularity best matches human concavity effects, while too-rough or too-precise representations diverge. The findings support a resource-rational view: coarse, physics-relevant representations emerge from capacity and computation constraints rather than bespoke biases. Practically, the study offers simple levers—mid-training checkpoints, moderate pruning, and mid-sized architectures—to elicit physics-efficient representations in vision models, with implications for human-AI interaction and safe deployment in dynamic environments.
Abstract
Humans appear to represent objects for intuitive physics with coarse, volumetric bodies'' that smooth concavities - trading fine visual details for efficient physical predictions - yet their internal structure is largely unknown. Segmentation models, in contrast, optimize pixel-accurate masks that may misalign with such bodies. We ask whether and when these models nonetheless acquire human-like bodies. Using a time-to-collision (TTC) behavioral paradigm, we introduce a comparison pipeline and alignment metric, then vary model training time, size, and effective capacity via pruning. Across all manipulations, alignment with human behavior follows an inverse U-shaped curve: small/briefly trained/pruned models under-segment into blobs; large/fully trained models over-segment with boundary wiggles; and an intermediate ideal body granularity'' best matches humans. This suggests human-like coarse bodies emerge from resource constraints rather than bespoke biases, and points to simple knobs - early checkpoints, modest architectures, light pruning - for eliciting physics-efficient representations. We situate these results within resource-rational accounts balancing recognition detail against physical affordances.
