Table of Contents
Fetching ...

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

Yuhang Yang, Jinhong Deng, Wen Li, Lixin Duan

TL;DR

The Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block of CLIP, and the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores.

Abstract

While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP's non-final layers also exhibits localization properties. Therefore, we propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference. Furthermore, to enhance the focus on regions of the same categories and local consistency, we propose the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores. By integrating these two strategies, our method, termed ResCLIP, can be easily incorporated into existing approaches as a plug-and-play module, significantly boosting their performance in dense vision-language inference. Extensive experiments across multiple standard benchmarks demonstrate that our method surpasses state-of-the-art training-free methods, validating the effectiveness of the proposed approach. Code is available at https://github.com/yvhangyang/ResCLIP.

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

TL;DR

The Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block of CLIP, and the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores.

Abstract

While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP's non-final layers also exhibits localization properties. Therefore, we propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference. Furthermore, to enhance the focus on regions of the same categories and local consistency, we propose the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores. By integrating these two strategies, our method, termed ResCLIP, can be easily incorporated into existing approaches as a plug-and-play module, significantly boosting their performance in dense vision-language inference. Extensive experiments across multiple standard benchmarks demonstrate that our method surpasses state-of-the-art training-free methods, validating the effectiveness of the proposed approach. Code is available at https://github.com/yvhangyang/ResCLIP.

Paper Structure

This paper contains 16 sections, 13 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: The attention visualization from different layers of CLIP radford2021learning model. The images are sampled from PASCAL VOC everingham2015pascal dataset.
  • Figure 2: (a) Cross-correlation self-attention ($\text{C}^{2}$SA). The query and key are mapped by different project matrices. The attention is obtained by matrix multiplication between query and key. (b) Self-correlation self-attention (SCSA). The attention is calculated by the self-correlation such as key-key or query-query. (c) Residual Cross-correlation Self-attention (RCS) and Semantic Feedback Refinement (SFR). (d) The performance comparison between our methods and baselines.
  • Figure 3: Comparison of attention maps across different versions of CLIP and ours.
  • Figure 4: Overview of our ResCLIP consisting of Residual Cross-correlation Self-attention (RCS) and Semantic Feedback Refinement (SFR). The RCS module enhances CLIP's attention mechanism by fusing C$^2$SA from non-last layers $\mathcal{A}_c$ with SCSA $\mathcal{A}_s$ to capture richer spatial information. The SFR module leverages an initial segmentation mask (black arrows) to refine attention scores. These refined attention scores $\hat{S}$ are combined with RCS to adjust the attention in the last layer of CLIP and produce the final prediction (blue arrows).
  • Figure 5: Qualitative comparison between different CLIP-based training-free segmentation methods.
  • ...and 5 more figures