Table of Contents
Fetching ...

FVG-PT: Adaptive Foreground View-Guided Prompt Tuning for Vision-Language Models

Haoyang Li, Liang Wang, Siyu Zhou, Jiacheng Sun, Jing Jiang, Chao Wang, Guodong Long, Yan Peng

TL;DR

FVG-PT introduces a learnable Foreground Reliability Gate to automatically enhance the foreground view quality, applies a Foreground Distillation Compensation module to guide visual attention toward the foreground, and introduces a Prior Calibration module to mitigate generalization degradation caused by excessive focus on the foreground.

Abstract

CLIP-based prompt tuning enables pretrained Vision-Language Models (VLMs) to efficiently adapt to downstream tasks. Although existing studies have made significant progress, they pay limited attention to changes in the internal attention representations of VLMs during the tuning process. In this paper, we attribute the failure modes of prompt tuning predictions to shifts in foreground attention of the visual encoder, and propose Foreground View-Guided Prompt Tuning (FVG-PT), an adaptive plug-and-play foreground attention guidance module, to alleviate the shifts. Concretely, FVG-PT introduces a learnable Foreground Reliability Gate to automatically enhance the foreground view quality, applies a Foreground Distillation Compensation module to guide visual attention toward the foreground, and further introduces a Prior Calibration module to mitigate generalization degradation caused by excessive focus on the foreground. Experiments on multiple backbone models and datasets show the effectiveness and compatibility of FVG-PT. Codes are available at: https://github.com/JREion/FVG-PT

FVG-PT: Adaptive Foreground View-Guided Prompt Tuning for Vision-Language Models

TL;DR

FVG-PT introduces a learnable Foreground Reliability Gate to automatically enhance the foreground view quality, applies a Foreground Distillation Compensation module to guide visual attention toward the foreground, and introduces a Prior Calibration module to mitigate generalization degradation caused by excessive focus on the foreground.

Abstract

CLIP-based prompt tuning enables pretrained Vision-Language Models (VLMs) to efficiently adapt to downstream tasks. Although existing studies have made significant progress, they pay limited attention to changes in the internal attention representations of VLMs during the tuning process. In this paper, we attribute the failure modes of prompt tuning predictions to shifts in foreground attention of the visual encoder, and propose Foreground View-Guided Prompt Tuning (FVG-PT), an adaptive plug-and-play foreground attention guidance module, to alleviate the shifts. Concretely, FVG-PT introduces a learnable Foreground Reliability Gate to automatically enhance the foreground view quality, applies a Foreground Distillation Compensation module to guide visual attention toward the foreground, and further introduces a Prior Calibration module to mitigate generalization degradation caused by excessive focus on the foreground. Experiments on multiple backbone models and datasets show the effectiveness and compatibility of FVG-PT. Codes are available at: https://github.com/JREion/FVG-PT
Paper Structure (53 sections, 14 equations, 9 figures, 15 tables)

This paper contains 53 sections, 14 equations, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Comparison of visual encoder attention maps generated by Grad-CAM selvaraju2017gradcam for the same image under (a) original CLIP radford2021clip, (b) CoOp zhou2022coop, and (c) our FVG-PT. In the bad cases of (a) and (b), attention deviates from the foreground view, where FVG-PT effectively suppresses this shift and leads to a correct prediction.
  • Figure 2: Framework of our proposed FVG-PT. As a plug-and-play method, in (a) tuning stage, FVG-PT obtains the foreground view $x^{\mathrm{fg}}$ of image $x$ and fine-tunes the (b) Foreground Reliability Gate to learn a foreground trust score $r$. Meanwhile, Foreground Distillation Compensation module inserts adapters after image-text alignment of frozen backbone model to guide visual attention toward the foreground. In parallel, independent Prior Calibration fine-tunes the (c) Backbone Reliability Gate on new branch (indicated by dashed lines) to balance the tuned model and the CLIP prior.
  • Figure 3: Inference stage of FVG-PT on (a) base branch and (b) new branch. The design of Prior Calibration (Sec. \ref{['sec3.4']}) enables full decoupling between the two branches during inference, addressing the BNT problem.
  • Figure 4: Base-to-new performance comparison across (a) visual attention-related prompt tuning methods and (b) different weights of the FDC distillation loss $\lambda_{d}$.
  • Figure 5: Trends of the foreground shift index and base-class accuracy for CLIP, fine-tuned CoOp, and our FVG-PT on the (a) Caltech101 and (b) Flowers102 datasets.
  • ...and 4 more figures