Table of Contents
Fetching ...

DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles

Yiming Ma, Hongkun Yang, Lionel Z. Wang, Bin Chen, Weizhi Xian, Jianzhi Teng

TL;DR

DeAR is proposed, a framework that achieves fine-grained VLM adaptation by composing attention heads and incorporating a Task-Adaptive Fusion Strategy for inference, and introduces specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow.

Abstract

Prompt learning is a dominant paradigm for adapting pre-trained Vision-Language Models (VLMs) to downstream tasks. However, existing methods often rely on a simplistic, layer-centric view, assuming shallow layers capture general features while deep layers handle task-specific knowledge. This assumption results in uncontrolled interactions between learnable tokens and original tokens. Task-specific knowledge could degrades the model's core generalization and creates a trade-off between task adaptation and the preservation of zero-shot generalization. To address this, we challenge the layer-centric view and propose \textbf{DeAR}, a framework that achieves fine-grained VLM adaptation by \textbf{De}composing \textbf{A}ttention head \textbf{R}oles. We posit that the functional specialization within VLMs occurs not between layers, but at the finer-grained level of individual attention heads in the deeper layers. Based on this insight, we introduce a novel metric, Concept Entropy, to systematically classify attention heads into distinct functional roles: \textit{Attribute}, \textit{Generalization}, and \textit{Mixed}. Guided by these roles, we introduce specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow, ensuring generalization heads remain isolated from task-specific knowledge. We further incorporate a Task-Adaptive Fusion Strategy for inference. Extensive experiments on fifteen datasets show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.

DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles

TL;DR

DeAR is proposed, a framework that achieves fine-grained VLM adaptation by composing attention heads and incorporating a Task-Adaptive Fusion Strategy for inference, and introduces specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow.

Abstract

Prompt learning is a dominant paradigm for adapting pre-trained Vision-Language Models (VLMs) to downstream tasks. However, existing methods often rely on a simplistic, layer-centric view, assuming shallow layers capture general features while deep layers handle task-specific knowledge. This assumption results in uncontrolled interactions between learnable tokens and original tokens. Task-specific knowledge could degrades the model's core generalization and creates a trade-off between task adaptation and the preservation of zero-shot generalization. To address this, we challenge the layer-centric view and propose \textbf{DeAR}, a framework that achieves fine-grained VLM adaptation by \textbf{De}composing \textbf{A}ttention head \textbf{R}oles. We posit that the functional specialization within VLMs occurs not between layers, but at the finer-grained level of individual attention heads in the deeper layers. Based on this insight, we introduce a novel metric, Concept Entropy, to systematically classify attention heads into distinct functional roles: \textit{Attribute}, \textit{Generalization}, and \textit{Mixed}. Guided by these roles, we introduce specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow, ensuring generalization heads remain isolated from task-specific knowledge. We further incorporate a Task-Adaptive Fusion Strategy for inference. Extensive experiments on fifteen datasets show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.
Paper Structure (52 sections, 15 equations, 6 figures, 11 tables)

This paper contains 52 sections, 15 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: For a Generalization Head (e.g., Layer 9 Head 0), our Role-Based Mask explicitly blocking the original CLS and Patch tokens from interacting to the inserted Attribute Tokens.
  • Figure 2: The overall framework of DeAR. At a deep layer $J$, we insert learnable attribute tokens into the frozen vision encoder and the text encoder. Information flow in these subsequent layers is precisely controlled by our Role-Based Attention Mask mechanism. For inference, the class feature (from the [CLS] token) and attribute feature first compute the logits with the text feature. These logits are then combined via learnable fusion weights to produce the final prediction.
  • Figure 3: Results of attribute-conditioned image retrieval on ImageNet. For each query image, we retrieval using features from different learned attribute tokens.
  • Figure 4: Comparison of DeAR with previous methods on few-shot learning. Detailed results are provided in the Appendix
  • Figure 5: Few-Shot Performance Comparison. Comparison of DeAR with previous methods across 11 datasets. DeAR consistently achieves superior performance in few-shot settings.
  • ...and 1 more figures