Table of Contents
Fetching ...

Human-level 3D shape perception emerges from multi-view learning

Tyler Bonnen, Jitendra Malik, Angjoo Kanazawa

TL;DR

The paper tackles the question of whether human-level 3D shape perception can emerge from general-purpose learning over naturalistic visual-spatial data. It introduces multi-view vision transformers trained with a visual-spatial objective that predicts spatial cues from sets of images captured from different viewpoints, without object-specific inductive biases. In zero-shot 3D perception tasks, VGGT matches human accuracy and outperforms single-view models, with model readouts predicting human error patterns and reaction times, and a correspondence-based representation revealed through attention analyses. The results support empiricist theories of perception, demonstrate a scalable framework for linking model dynamics to human behavior, and provide open-source resources for reproduction and further study.

Abstract

Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view' models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.

Human-level 3D shape perception emerges from multi-view learning

TL;DR

The paper tackles the question of whether human-level 3D shape perception can emerge from general-purpose learning over naturalistic visual-spatial data. It introduces multi-view vision transformers trained with a visual-spatial objective that predicts spatial cues from sets of images captured from different viewpoints, without object-specific inductive biases. In zero-shot 3D perception tasks, VGGT matches human accuracy and outperforms single-view models, with model readouts predicting human error patterns and reaction times, and a correspondence-based representation revealed through attention analyses. The results support empiricist theories of perception, demonstrate a scalable framework for linking model dynamics to human behavior, and provide open-source resources for reproduction and further study.

Abstract

Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view' models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.
Paper Structure (1 section, 1 equation, 14 figures)

This paper contains 1 section, 1 equation, 14 figures.

Table of Contents

  1. Supplementary Material

Figures (14)

  • Figure 1: Schematic of multi-view model training approach and 3D perceptual testing protocol. We evaluate a novel class of multi-view transformers (VGGT-1B, wang2025vggt), which is trained on large-scale, multi-view, naturalistic scene data. During training, VGGT receives sets of images depicting the same scene from different viewpoints (top left) and must learn to predict the relative depth, camera position, and aleatoric uncertainty associated with these images (bottom left). These multi-modal signals are analogous to information that is available to humans through stereo vision and proprioception. Notably, VGGT uses a general transformer architecture with no hand-coded geometric priors: any understanding of 3D structure emerges from learning the predictive relationship between images and these multi-modal cues. To evaluate these multi-view transformers alongside human observers on a 3D perception task, we use a standard experimental design from the cognitive sciences: a concurrent visual discrimination ('oddity') task. This design requires zero-shot visual inference about object shape: given two images of an object from different viewpoints (A and A$'$), and another image of a different object (B), the task is to determine which image contains the non-matching object (e.g., right). We evaluate humans and models on diverse object types, including real-world objects (e.g., chairs, tables) as well as procedurally generated abstract shapes (i.e., 'nonsense' objects).
  • Figure 2: Evaluation approach used to estimate model performance on each trial. We develop a series of zero-shot evaluation metrics to determine the behavior of multi-view models on human 3D perceptual tasks. To estimate accuracy we leverage the model's internal estimate of aleatoric uncertainty (top): we encode all pairwise combinations of images on a trial, extract model uncertainty estimates for each pair, average, then select the oddity to be the object with the lowest paired confidence scores. We compare this model-selected non-match with the ground truth, resulting in a single binary (correct/incorrect) outcome. Next, we compute the margin between matching/non-matching objects to determine the model's confidence for this decision ($\Delta$ top right). We visualize the norm of model responses across several layers, going from early to final layers (left to right). Note that the activation patterns for matching (AA$'$) objects is visibly distinct from those of the non-matching (AB/BA$'$) objects in the final layers. When conducting a similarity analysis across all layers (bottom), for each encoded pair of images, we find that the features of matching objects (AA$'$) become more correlated (orange, bottom), while non-matching objects (AB/BA$'$) become less correlated (black/purple, bottom). We describe the earliest layer where the non-matching object can be identified as the model 'solution layer'.
  • Figure 3: Multi-view models match human 3D perception accuracy, error patterns, and reaction times. When comparing normalized accuracy of humans and models across all conditions of this 3D perception benchmark (left), VGGT matches human performance, while both humans and multi-view models significantly outperform standard vision models like DINOv2. Critically, VGGT's human-level performance does not depend on task-specific training or fine-tuning; the model used only its pre-trained representations learned from multi-view prediction. These findings demonstrate that multi-view learning is sufficient to achieve human-level 3D object perception. Beyond average accuracy, we find that model confidence (i.e., the margin between matching and non-matching objects in each trial) is significantly correlated with human choice behaviors. This indicates that the aleatoric uncertainty used during training provides a natural analogue for human perceptual judgments. Finally, we observe a clear correspondence between model solution layer and human reaction time; as the number of layers required to solve this perceptual task increases, so too does human reaction time needed for correct responses. These data reveal an emergent correspondence between model dynamics and human perception.
  • Figure 4: Investigating between-image activations reveals location-based object correspondence. How does this multi-view model represent the similarity or difference between objects? To address this question we provide a qualitative visualization of the information present in intermediate model layers. For an example trial (e.g., abstract image in the top half; chair in the bottom half), we encoded image pairs A, A$'$ and A, B separately, then extract the cross-image block of the attention matrix. Here we manually select keypoints in the reference image, A (different color dots on A, far left), and identify the corresponding patch token in the target images A$'$ and B. We then retrieve the attention distribution over all the target patch tokens. In intermediate model layers (e.g., here we visual attention from layer 15) we find that different query locations from the reference image A elicit distinct attention patterns across points and across target images A$'$ and B. Concretely, it appears that each query location on A elicits a pattern of attention in A$'$ that correspondence to the same location on the object, albeit a different location in xyz coordinates. This qualitative analysis indicates that the model can represent the object-object similarity via the correspondence between spatial locations on each object.
  • Figure S1: Standard vision models fail on 3D shape inferences. We evaluate the ability of large vision models on the MOCHI benchmark (bonnen2024evaluating) using cosine distance, without any learned readout. For each trial, the image with the highest average cosine distance to all other images is selected as the oddity. Across DINOv2, CLIP, and MAE model families at multiple scales, no model approaches human-level performance, with MAE models performing at chance. Even the best-performing model, DINOv2-giant, achieves less than half of human accuracy, indicating that the geometric structure of these feature spaces does not naturally separate 3D object identity. Note that these model performance values are not normalized, unlike subsequent analyses.
  • ...and 9 more figures