Table of Contents
Fetching ...

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

Jinda Lu, Shuo Wang, Yanbin Hao, Haifeng Liu, Xiang Wang, Meng Wang

TL;DR

A Visual Content Refinement (VCR) is proposed before the adaptation calculation during the test stage to boost the low-shot capability of Contrastive Vision-Language Pre-training (CLIP) by effectively facilitating knowledge transfer.

Abstract

Recent adaptations can boost the low-shot capability of Contrastive Vision-Language Pre-training (CLIP) by effectively facilitating knowledge transfer. However, these adaptation methods are usually operated on the global view of an input image, and thus biased perception of partial local details of the image. To solve this problem, we propose a Visual Content Refinement (VCR) before the adaptation calculation during the test stage. Specifically, we first decompose the test image into different scales to shift the feature extractor's attention to the details of the image. Then, we select the image view with the max prediction margin in each scale to filter out the noisy image views, where the prediction margins are calculated from the pre-trained CLIP model. Finally, we merge the content of the aforementioned selected image views based on their scales to construct a new robust representation. Thus, the merged content can be directly used to help the adapter focus on both global and local parts without any extra training parameters. We apply our method to 3 popular low-shot benchmark tasks with 13 datasets and achieve a significant improvement over state-of-the-art methods. For example, compared to the baseline (Tip-Adapter) on the few-shot classification task, our method achieves about 2\% average improvement for both training-free and training-need settings.

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

TL;DR

A Visual Content Refinement (VCR) is proposed before the adaptation calculation during the test stage to boost the low-shot capability of Contrastive Vision-Language Pre-training (CLIP) by effectively facilitating knowledge transfer.

Abstract

Recent adaptations can boost the low-shot capability of Contrastive Vision-Language Pre-training (CLIP) by effectively facilitating knowledge transfer. However, these adaptation methods are usually operated on the global view of an input image, and thus biased perception of partial local details of the image. To solve this problem, we propose a Visual Content Refinement (VCR) before the adaptation calculation during the test stage. Specifically, we first decompose the test image into different scales to shift the feature extractor's attention to the details of the image. Then, we select the image view with the max prediction margin in each scale to filter out the noisy image views, where the prediction margins are calculated from the pre-trained CLIP model. Finally, we merge the content of the aforementioned selected image views based on their scales to construct a new robust representation. Thus, the merged content can be directly used to help the adapter focus on both global and local parts without any extra training parameters. We apply our method to 3 popular low-shot benchmark tasks with 13 datasets and achieve a significant improvement over state-of-the-art methods. For example, compared to the baseline (Tip-Adapter) on the few-shot classification task, our method achieves about 2\% average improvement for both training-free and training-need settings.
Paper Structure (23 sections, 10 equations, 5 figures, 9 tables)

This paper contains 23 sections, 10 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: We use the responsive regions of images to visualize different perceived bias issues. Specifically, the responsive regions of the pre-trained CLIP model radford2021learning are visualized by Grad-CAM selvaraju2017grad from several samples from the validation set of ImageNet deng2009imagenet, where Component Bias and Environmental Bias represent two perceived bias issues, Hand-Crafted Promptzhang2022tip and Learnable Promptzhou2022learning are two different prompt strategies.
  • Figure 2: An overview of our visual content refinement, given an image, we firstly decompose it into multiple scales, where each scale contains sufficient local views, then we refine the content at each scale, and finally we construct its refined representation to boost further adaptation methods.
  • Figure 3: Few-shot performance with training-free methods on 11 datasets, we first show the average results, and the following are organized in the order of dataset names, zoom in for clear recognition.
  • Figure 4: Few-shot performance with training-need methods on 11 datasets, we first show the average results, and the following are organized in the order of dataset names, zoom in for clear recognition.
  • Figure 5: The selected image regions from different scales by our VCR, where the responsive regions of the pre-trained CLIP model radford2021learning are visualized by Grad-CAM selvaraju2017grad from samples in the validation set of ImageNet deng2009imagenet, "Hand-Crafted zhang2022tip" and "Learnable zhou2022learning" are two different prompt strategies, and "Component Bias" and "Environmental Bias" are two perceived bias issues.