Table of Contents
Fetching ...

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

Juncheng Yang, Zuchao Li, Shuai Xie, Weiping Zhu, Wei Yu, Shijun Li

TL;DR

This work tackles the challenge of efficiently adapting large vision-language models to downstream tasks under limited data. It introduces XMAdapter, a cross-modal cache-based, parameter-efficient transfer learning framework that builds separate image and text caches and leverages cross-modal retrieval signals for inference. Key innovations include adaptive fusion of image and text affinities, a cross-modal cache construction via MetaNet and Img2TxtNet, and online hard-example mining to emphasize difficult samples. Empirical results across 15 benchmarks demonstrate improved accuracy, generalization, and efficiency compared with prior adapters, highlighting the practicality of retrieval-based cross-modal adaptation for VLMs.

Abstract

Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cross-modal cues for the efficient adaptation of parameters in visual-language models. This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion, decoupling different modal similarities to assess their respective contributions. Additionally, it explores hard samples based on differences in cross-modal affinity and enhances model performance through adaptive adjustment of sample learning intensity. Extensive experimental results on benchmark datasets demonstrate that XMAdapter outperforms previous adapter-based methods significantly regarding accuracy, generalization, and efficiency.

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

TL;DR

This work tackles the challenge of efficiently adapting large vision-language models to downstream tasks under limited data. It introduces XMAdapter, a cross-modal cache-based, parameter-efficient transfer learning framework that builds separate image and text caches and leverages cross-modal retrieval signals for inference. Key innovations include adaptive fusion of image and text affinities, a cross-modal cache construction via MetaNet and Img2TxtNet, and online hard-example mining to emphasize difficult samples. Empirical results across 15 benchmarks demonstrate improved accuracy, generalization, and efficiency compared with prior adapters, highlighting the practicality of retrieval-based cross-modal adaptation for VLMs.

Abstract

Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cross-modal cues for the efficient adaptation of parameters in visual-language models. This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion, decoupling different modal similarities to assess their respective contributions. Additionally, it explores hard samples based on differences in cross-modal affinity and enhances model performance through adaptive adjustment of sample learning intensity. Extensive experimental results on benchmark datasets demonstrate that XMAdapter outperforms previous adapter-based methods significantly regarding accuracy, generalization, and efficiency.
Paper Structure (20 sections, 6 equations, 3 figures, 5 tables)

This paper contains 20 sections, 6 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: XMAdapter and Competing Methods: CLIP-Adapter gao2023clip, Tip-Adapter zhang2022tip. XMAdapter learns.
  • Figure 2: Illustration of the proposed XMAdapter. The (reddish orange) line depicts the flow of image features, while the (pea green) line represents the flow of text features. The model initially establishes a cache model with key-value pairs. Subsequently, it enhances the robustness of the model through adaptive adjustment of the fusion ratio between images and text, along with a strategy for learning hard samples. Finally, the model incorporates the knowledge from the original VLM to improve the accuracy of predictions.
  • Figure 3: The performance comparison of our XMAdapter with the SOTA method on Cross Label Generalization, including 1-/2-/4-/8-/16-shots on 11 benchmark datasets.