Table of Contents
Fetching ...

Human-like Object Grouping in Self-supervised Vision Transformers

Hossein Adeli, Seoyoung Ahn, Andrew Luo, Mengmi Zhang, Nikolaus Kriegeskorte, Gregory Zelinsky

Abstract

Vision foundation models trained with self-supervised objectives achieve strong performance across diverse tasks and exhibit emergent object segmentation properties. However, their alignment with human object perception remains poorly understood. Here, we introduce a behavioral benchmark in which participants make same/different object judgments for dot pairs on naturalistic scenes, scaling up a classical psychophysics paradigm to over 1000 trials. We test a diverse set of vision models using a simple readout from their representations to predict subjects' reaction times. We observe a steady improvement across model generations, with both architecture and training objective contributing to alignment, and transformer-based models trained with the DINO self-supervised objective showing the strongest performance. To investigate the source of this improvement, we propose a novel metric to quantify the object-centric component of representations by measuring patch similarity within and between objects. Across models, stronger object-centric structure predicts human segmentation behavior more accurately. We further show that matching the Gram matrix of supervised transformer models, capturing similarity structure across image patches, with that of a self-supervised model through distillation improves their alignment with human behavior, converging with the prior finding that Gram anchoring improves DINOv3's feature quality. Together, these results demonstrate that self-supervised vision models capture object structure in a behaviorally human-like manner, and that Gram matrix structure plays a role in driving perceptual alignment.

Human-like Object Grouping in Self-supervised Vision Transformers

Abstract

Vision foundation models trained with self-supervised objectives achieve strong performance across diverse tasks and exhibit emergent object segmentation properties. However, their alignment with human object perception remains poorly understood. Here, we introduce a behavioral benchmark in which participants make same/different object judgments for dot pairs on naturalistic scenes, scaling up a classical psychophysics paradigm to over 1000 trials. We test a diverse set of vision models using a simple readout from their representations to predict subjects' reaction times. We observe a steady improvement across model generations, with both architecture and training objective contributing to alignment, and transformer-based models trained with the DINO self-supervised objective showing the strongest performance. To investigate the source of this improvement, we propose a novel metric to quantify the object-centric component of representations by measuring patch similarity within and between objects. Across models, stronger object-centric structure predicts human segmentation behavior more accurately. We further show that matching the Gram matrix of supervised transformer models, capturing similarity structure across image patches, with that of a self-supervised model through distillation improves their alignment with human behavior, converging with the prior finding that Gram anchoring improves DINOv3's feature quality. Together, these results demonstrate that self-supervised vision models capture object structure in a behaviorally human-like manner, and that Gram matrix structure plays a role in driving perceptual alignment.
Paper Structure (12 sections, 7 figures, 1 table)

This paper contains 12 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: A) Example affinity maps for a few image patches, generated using feature tokens from the DINOv3 ViT-B/16 model. For a given patch, the cosine similarity between its feature vector and the feature vectors of all other patches is shown as an affinity map, with brighter values indicating stronger similarity. Patches belonging to the same object tend to have the highest affinity, reflecting object-centric structure in the representations. B) The full Gram matrix for the same image, showing pairwise feature similarities between all patches. The block-like structure visible along the diagonal reflects clusters of patches with high mutual similarity, corresponding to distinct objects in the scene.
  • Figure 2: A) Behavioral procedure. Participants maintain fixation on a center dot during the trial. A second dot appears following the center dot and remains visible for 1000 ms, after which the scene appears and the dots begin flickering to ensure their visibility. Subjects are instructed to respond whether the two dots are on the same or two different objects as quickly as possible without sacrificing accuracy. B) Sample trials from all four experimental conditions, coded by different colors. C) Placement of dots across all conditions and trials. D) Mean reaction time for correct trials by condition, with SEM (Standard Error of the Mean) error bars.
  • Figure 3: Sample behavioral results for the four conditions in our experiment. The mean reaction time across subjects is displayed above.
  • Figure 4: A) Noise-normalized Spearman correlation between model-predicted and human reaction times across all models, ordered from lowest to highest. The dashed line indicates the human noise ceiling. Models trained with self-supervised DINO objectives consistently outperform supervised counterparts with the same architecture. B) Mean reaction times predicted by DINOv3 ViT B for each experimental condition. The model reproduces the key signatures of human grouping behavior, including faster responses for same-object trials and a distance effect that is specific to the same-object condition.
  • Figure 5: A) A sample experimental trial with the central dot shown on the top left, alongside its affinity map (using DINOv3 features) showing normalized feature similarity between the central patch and all other patches (colorbar shown). For decreasing threshold values ($\theta$), patches with affinity above the threshold are shown in yellow. The TPR and FPR are displayed above each thresholded map. TPR increases substantially before FPR rises, indicating a strong object-centric signal in the affinity map. B) ROC curves averaged across all trials for each of the 12 layers of the DINOv3 ViT B model. The legend is ordered by decreasing AUC, with deeper layers showing stronger object-centric structure. C) ROC curves for features from different attention heads of the last layer of DINOv3 ViT B, showing broadly similar object-centricity across heads.
  • ...and 2 more figures