Table of Contents
Fetching ...

Improving Image Clustering with Artifacts Attenuation via Inference-Time Attention Engineering

Kazumoto Nakamura, Yuji Nozawa, Yu-Chieh Lin, Kengo Nakata, Youyang Ng

TL;DR

The goal of this paper is to improve the performance of pretrained ViT models, particularly DINOv2, in image clustering task without requiring re-training or fine-tuning, and proposes an approach called Inference-Time Attention Engineering (ITAE), which manipulates attention function during inference.

Abstract

The goal of this paper is to improve the performance of pretrained Vision Transformer (ViT) models, particularly DINOv2, in image clustering task without requiring re-training or fine-tuning. As model size increases, high-norm artifacts anomaly appears in the patches of multi-head attention. We observe that this anomaly leads to reduced accuracy in zero-shot image clustering. These artifacts are characterized by disproportionately large values in the attention map compared to other patch tokens. To address these artifacts, we propose an approach called Inference-Time Attention Engineering (ITAE), which manipulates attention function during inference. Specifically, we identify the artifacts by investigating one of the Query-Key-Value (QKV) patches in the multi-head attention and attenuate their corresponding attention values inside the pretrained models. ITAE shows improved clustering accuracy on multiple datasets by exhibiting more expressive features in latent space. Our findings highlight the potential of ITAE as a practical solution for reducing artifacts in pretrained ViT models and improving model performance in clustering tasks without the need for re-training or fine-tuning.

Improving Image Clustering with Artifacts Attenuation via Inference-Time Attention Engineering

TL;DR

The goal of this paper is to improve the performance of pretrained ViT models, particularly DINOv2, in image clustering task without requiring re-training or fine-tuning, and proposes an approach called Inference-Time Attention Engineering (ITAE), which manipulates attention function during inference.

Abstract

The goal of this paper is to improve the performance of pretrained Vision Transformer (ViT) models, particularly DINOv2, in image clustering task without requiring re-training or fine-tuning. As model size increases, high-norm artifacts anomaly appears in the patches of multi-head attention. We observe that this anomaly leads to reduced accuracy in zero-shot image clustering. These artifacts are characterized by disproportionately large values in the attention map compared to other patch tokens. To address these artifacts, we propose an approach called Inference-Time Attention Engineering (ITAE), which manipulates attention function during inference. Specifically, we identify the artifacts by investigating one of the Query-Key-Value (QKV) patches in the multi-head attention and attenuate their corresponding attention values inside the pretrained models. ITAE shows improved clustering accuracy on multiple datasets by exhibiting more expressive features in latent space. Our findings highlight the potential of ITAE as a practical solution for reducing artifacts in pretrained ViT models and improving model performance in clustering tasks without the need for re-training or fine-tuning.
Paper Structure (30 sections, 5 equations, 6 figures, 8 tables)

This paper contains 30 sections, 5 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overall introduction of ITAE approach: We manipulate self-attention function of a pretrained vision model during inference. Specifically, we identify artifact in the final layer of the model's attention block during inference time by computing the $L_2$ norms of the QKV patches, and attenuate the values of their corresponding attentions. $N+1$ represents the length of the $QK^T$ matrix, including the CLS token.
  • Figure 2: The left plots show the $L_2$ norms distribution for model pretrained without register tokens while the right plots show the $L_2$ norms distribution including register tokens for model pretrained with register tokens. Query norm graphs show the $L_2$ norms distribution of QKV patches excluding CLS token. Output norm graphs show the $L_2$ norms distribution of the final output. Clear bimodality is observed in the $L_2$ norms distribution of QKV patches (model: ViT-B/14 distilled, dataset: CIFAR-100).
  • Figure 3: Histogram of attention values: Yellow colored distribution shows the artifact patches identified while blue colored distribution shows normal patches. The left and right plots are for model pretrained without register tokens and model pretrained with register tokens, respectively (model: ViT-B/14 distilled, dataset: CIFAR-100). Best viewed in color.
  • Figure 4: Attention map: The left maps are the attention maps of original model and the right maps are the attention maps of the model incorporating our proposed method. The attention map is the averaged map across each attention head in the multi-head attention module. Best viewed in color. Details and licenses for the images are provided in supplementary material (model: ViT-B/14 distilled, dataset: MS COCO).
  • Figure 5: $\theta$-accuracy graph: Change in clustering accuracy (ACC) when $\theta$ is varied from 1.0 to 8.0 for each model (base: ViT-B/14 distilled, large: ViT-L/14 distilled, giant: ViT-g/14, dataset: CIFAR-100). Best viewed in color.
  • ...and 1 more figures