Table of Contents
Fetching ...

Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, Narendra Ahuja

Abstract

Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ($q, k, v$), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO's effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.

Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Abstract

Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components (), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO's effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.

Paper Structure

This paper contains 26 sections, 12 equations, 16 figures, 6 tables, 1 algorithm.

Figures (16)

  • Figure 1: Object-centric information is encoded in patch-level interactions. We visualize the inter-patch similarity maps ($A_q, A_k, A_v$) computed from the Query ($Q$), Key ($K$), and Value ($V$) representations of patch tokens. Each component captures a complementary view of object structure: Query Similarity reveals which patches seek similar information, Key Similarity shows which patches offer similar context, and Value Similarity identifies patches with similar content. The Ensemble aggregates all three components, producing foreground-background separation and precise object localization. For visualization, we invert the similarity maps so objects appear bright.
  • Figure 3: Object-DINO Overview. Our training-free algorithm identifies a distributed set of object-centric heads from a pre-trained models such as DINO. First, for every head across all layers, we compute the patch similarity maps from its query ($A_{q}$), key ($A_{k}$), and value ($A_{v}$) representations. Second, these three maps are ensembled and flattened to create a vector representing each head's localization pattern. Third, all heads are clustered based on these pattern. Fourth, guided by the established observation that object-centric information is most prevalent in the final layer dinotokencutsimeoni2021localizing, we automatically identify the object Cluster ($c_{\text{obj}}$) using the criterion that it contains the highest prevalence of final-layer heads. Finally, aggregating the similarity maps from only the object-centric heads ($\mathcal{H}_{\text{obj}}$) produces a high-fidelity object localization, effectively filtering noise from non-object-centric heads. For visualization, we invert the maps to show bright colors for objects.
  • Figure 4: Mitigating MLLM Hallucination with Object-DINO Visual Guidance. Our training-free decoding strategy computes two separate logit distributions. The standard branch uses the original image $u$ and a general prompt $T_u$ (e.g., "describe this image"). The guidance branch uses the Object-DINO's object map $v$ and a prompt $T_v$ (e.g., "describe the highlighted regions"). Here, R is the text prompt generated so far. We then add these logits, $Logits(y|T_u, R, u) + \alpha \cdot Logits(y|T_v, R, v)$, to amplify tokens consistent with the visual evidence, thus correcting the hallucinations (e.g., "Two" $\rightarrow$ "Three" dogs)
  • Figure 5: Experimental results of MME on the hallucination subset
  • Figure 6: Impact of Attention Components on Head Selection. CorLoc performance across three datasets (VOC2007, VOC2012, COCO20k) when using different attention components ($Q, K, V$) versus their ensemble to identify object-centric heads via Object-DINO. The performance ordering is $Q < V < K < \text{Ensemble}$ across all datasets. This demonstrates that $\text{Ensemble}$ leads to identification of robust object-centric heads.
  • ...and 11 more figures