Table of Contents
Fetching ...

CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification

Qijie Wang, Guandu Liu, Bin Wang

TL;DR

CapS-Adapter is presented, an innovative method that employs a caption-based support set, effectively harnessing both image and caption features to exceed existing state-of-the-art techniques in training-free scenarios, demonstrating superior performance and robust generalization capabilities.

Abstract

Recent advances in vision-language foundational models, such as CLIP, have demonstrated significant strides in zero-shot classification. However, the extensive parameterization of models like CLIP necessitates a resource-intensive fine-tuning process. In response, TIP-Adapter and SuS-X have introduced training-free methods aimed at bolstering the efficacy of downstream tasks. While these approaches incorporate support sets to maintain data distribution consistency between knowledge cache and test sets, they often fall short in terms of generalization on the test set, particularly when faced with test data exhibiting substantial distributional variations. In this work, we present CapS-Adapter, an innovative method that employs a caption-based support set, effectively harnessing both image and caption features to exceed existing state-of-the-art techniques in training-free scenarios. CapS-Adapter adeptly constructs support sets that closely mirror target distributions, utilizing instance-level distribution features extracted from multimodal large models. By leveraging CLIP's single and cross-modal strengths, CapS-Adapter enhances predictive accuracy through the use of multimodal support sets. Our method achieves outstanding zero-shot classification results across 19 benchmark datasets, improving accuracy by 2.19\% over the previous leading method. Our contributions are substantiated through extensive validation on multiple benchmark datasets, demonstrating superior performance and robust generalization capabilities. Our code is made publicly available at https://github.com/WLuLi/CapS-Adapter.

CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification

TL;DR

CapS-Adapter is presented, an innovative method that employs a caption-based support set, effectively harnessing both image and caption features to exceed existing state-of-the-art techniques in training-free scenarios, demonstrating superior performance and robust generalization capabilities.

Abstract

Recent advances in vision-language foundational models, such as CLIP, have demonstrated significant strides in zero-shot classification. However, the extensive parameterization of models like CLIP necessitates a resource-intensive fine-tuning process. In response, TIP-Adapter and SuS-X have introduced training-free methods aimed at bolstering the efficacy of downstream tasks. While these approaches incorporate support sets to maintain data distribution consistency between knowledge cache and test sets, they often fall short in terms of generalization on the test set, particularly when faced with test data exhibiting substantial distributional variations. In this work, we present CapS-Adapter, an innovative method that employs a caption-based support set, effectively harnessing both image and caption features to exceed existing state-of-the-art techniques in training-free scenarios. CapS-Adapter adeptly constructs support sets that closely mirror target distributions, utilizing instance-level distribution features extracted from multimodal large models. By leveraging CLIP's single and cross-modal strengths, CapS-Adapter enhances predictive accuracy through the use of multimodal support sets. Our method achieves outstanding zero-shot classification results across 19 benchmark datasets, improving accuracy by 2.19\% over the previous leading method. Our contributions are substantiated through extensive validation on multiple benchmark datasets, demonstrating superior performance and robust generalization capabilities. Our code is made publicly available at https://github.com/WLuLi/CapS-Adapter.
Paper Structure (34 sections, 13 equations, 6 figures, 16 tables)

This paper contains 34 sections, 13 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Radar chart. The line in the color represents our method CapS-Adapter. CapS-Adapter demonstrates superior performance on 19 datasets.
  • Figure 2: Caps-Adapter workflow. (a)CapS. It utilizes the image captions and category text as prompts. These prompts are used with a text-to-image model to create diverse images. These images and captions together form the CapS. (b) Utilizing the zero-shot M-Adapter for inference, which leverages the image and caption features from CapS to generate predictions. (c) Details of M-Adapter. It integrates the caption, category text, and image features to generate the similarity between the test images and categories.
  • Figure 3: Data sampled from target distribution and support set images of SuS-SD-CuPL, SuS-SD-Photo, CapS. Image samples CapS are more diverse and closer to the target distribution: showcasing a variety of apple pie shapes and both dynamic and static images of arctic terns.
  • Figure 4: Data distribution comparison. Visualized image features of samples from the the Target Distribution, support sets generated by SuS-SD-CuPL, SuS-SD-Photo, and image part in CapS. Features from CapS are notably closer to the target distribution and more diverse.
  • Figure 5: Accuracy changes as the number of images in the support set increases.
  • ...and 1 more figures