Table of Contents
Fetching ...

Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation

Feng Lin, Marco Chen, Haokui Zhang, Xiaotian Yu, Guangming Lu, Rong Xiao

TL;DR

Attention Ablation Technique (AAT) reveals that CLIP's image encoder contains detrimental attention heads whose ablation can refine representations without retraining. AAT provides two inference-friendly strategies, GA and BP, to automatically identify and suppress these heads by adjusting attention weights, achieving notable improvements in cross-modal retrieval and zero-shot tasks with minimal overhead. Across MS COCO, Flickr30k, ReCoS, COCO-CN, Flickr30k-CNA, ImageNet-1k, and Cola, AAT delivers consistent gains and demonstrates strong parameter efficiency relative to PEFT baselines, while aligning with interpretability findings that some heads carry spurious or domain-biased cues. The method offers practical benefits for deploying large-scale vision-language systems on constrained hardware and data regimes, while also highlighting limitations in domain transfer and compositional reasoning. Overall, AAT advances both the efficiency and interpretability of VLM refinements by directly manipulating attention at the head level rather than updating model weights.

Abstract

This paper investigates the role of attention heads in CLIP's image encoder. Building on interpretability studies, we conduct an exhaustive analysis and find that certain heads, distributed across layers, are detrimental to the resulting representations. To mitigate their impact, we propose a simple yet effective Attention Ablation Technique (AAT) that suppresses selected heads by directly manipulating their attention weights. By incorporating two complementary strategies tailored to different application scenarios, AAT enables the systematic identification and ablation of harmful heads with minimal overhead. Experiments show that AAT consistently improves downstream performance across diverse domains, boosting recall by up to 11.1% on cross-modal retrieval benchmarks. These results highlight that AAT can effectively refine large-scale VLMs with virtually no extra inference cost, while yielding semantically meaningful patterns that align with existing interpretability findings.

Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation

TL;DR

Attention Ablation Technique (AAT) reveals that CLIP's image encoder contains detrimental attention heads whose ablation can refine representations without retraining. AAT provides two inference-friendly strategies, GA and BP, to automatically identify and suppress these heads by adjusting attention weights, achieving notable improvements in cross-modal retrieval and zero-shot tasks with minimal overhead. Across MS COCO, Flickr30k, ReCoS, COCO-CN, Flickr30k-CNA, ImageNet-1k, and Cola, AAT delivers consistent gains and demonstrates strong parameter efficiency relative to PEFT baselines, while aligning with interpretability findings that some heads carry spurious or domain-biased cues. The method offers practical benefits for deploying large-scale vision-language systems on constrained hardware and data regimes, while also highlighting limitations in domain transfer and compositional reasoning. Overall, AAT advances both the efficiency and interpretability of VLM refinements by directly manipulating attention at the head level rather than updating model weights.

Abstract

This paper investigates the role of attention heads in CLIP's image encoder. Building on interpretability studies, we conduct an exhaustive analysis and find that certain heads, distributed across layers, are detrimental to the resulting representations. To mitigate their impact, we propose a simple yet effective Attention Ablation Technique (AAT) that suppresses selected heads by directly manipulating their attention weights. By incorporating two complementary strategies tailored to different application scenarios, AAT enables the systematic identification and ablation of harmful heads with minimal overhead. Experiments show that AAT consistently improves downstream performance across diverse domains, boosting recall by up to 11.1% on cross-modal retrieval benchmarks. These results highlight that AAT can effectively refine large-scale VLMs with virtually no extra inference cost, while yielding semantically meaningful patterns that align with existing interpretability findings.

Paper Structure

This paper contains 61 sections, 3 equations, 8 figures, 19 tables.

Figures (8)

  • Figure 1: An illustration of AAT-improved CLIP for text-to-image retrieval. "A-head $i$" denotes the $i$-th head in MHA. With model weights frozen, AAT ablates the selected image encoder's heads. Cross marks denote head ablation while check marks for retention.
  • Figure 2: The left $5 \times 5$ matrix shows the original attention weight, while the right depicts it after manipulation. CLS and IMG denote the class token and an image token, respectively, with a total length of 5. Darker blue indicates lower attention scores (approaching zero), while darker red indicates higher scores (approaching one).
  • Figure 3: Mean-R for text-to-image retrieval on the COCO-CN all set vs.$\beta$ values in AAT-GA, using the ViT-B-based model.
  • Figure 4: Number of ablated heads across layers in AAT for ViT-B, ViT-L, and ViT-H. ViT-B consists of 12 layers with 12 attention heads each; ViT-L has 24 layers with 16 heads per layer; and ViT-H includes 32 layers, also with 16 heads per layer.
  • Figure 5: Comparison for text-to-image retrieval among AAT-improved models, the SFT models, and the vanilla counterparts. For SFT models, mean-R across increasing training epochs are reported. Evaluation is conducted on the test sets of COCO-CN and Flickr30k-CNA for each model variant.
  • ...and 3 more figures