Table of Contents
Fetching ...

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

Cheng Cheng, Lin Song, Ruoyi Xue, Hang Wang, Hongbin Sun, Yixiao Ge, Ying Shan

TL;DR

This work addresses efficient online few-shot adaptation for vision-language models by introducing Meta-Adapter, a lightweight residual-style module that refines CLIP category embeddings using a small set of few-shot examples without offline fine-tuning. It replaces hand-crafted modulation in prior online methods with a learnable gated multi-head attention mechanism to fuse few-shot knowledge into textual features, enabling rapid Web-scale open-vocabulary perception. Empirical results show improved generalization across cross-category, cross-dataset, and cross-task settings, outperforming Tip-Adapter and, in some cases, offline methods, while maintaining high inference speed. The approach is plug-and-play for downstream CLIP-based tasks, including open-vocabulary object detection, underscoring its practical impact for robust, scalable vision-language reasoning.

Abstract

The contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts, enabling effective zero-shot image recognition. Nevertheless, few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples, resulting in longer inference time and the risk of over-fitting in certain domains. To tackle these challenges, we propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner. With a few training samples, our method can enable effective few-shot learning capabilities and generalize to unseen data or tasks without additional fine-tuning, achieving competitive performance and high efficiency. Without bells and whistles, our approach outperforms the state-of-the-art online few-shot learning method by an average of 3.6\% on eight image classification datasets with higher inference speed. Furthermore, our model is simple and flexible, serving as a plug-and-play module directly applicable to downstream tasks. Without further fine-tuning, Meta-Adapter obtains notable performance improvements in open-vocabulary object detection and segmentation tasks.

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

TL;DR

This work addresses efficient online few-shot adaptation for vision-language models by introducing Meta-Adapter, a lightweight residual-style module that refines CLIP category embeddings using a small set of few-shot examples without offline fine-tuning. It replaces hand-crafted modulation in prior online methods with a learnable gated multi-head attention mechanism to fuse few-shot knowledge into textual features, enabling rapid Web-scale open-vocabulary perception. Empirical results show improved generalization across cross-category, cross-dataset, and cross-task settings, outperforming Tip-Adapter and, in some cases, offline methods, while maintaining high inference speed. The approach is plug-and-play for downstream CLIP-based tasks, including open-vocabulary object detection, underscoring its practical impact for robust, scalable vision-language reasoning.

Abstract

The contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts, enabling effective zero-shot image recognition. Nevertheless, few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples, resulting in longer inference time and the risk of over-fitting in certain domains. To tackle these challenges, we propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner. With a few training samples, our method can enable effective few-shot learning capabilities and generalize to unseen data or tasks without additional fine-tuning, achieving competitive performance and high efficiency. Without bells and whistles, our approach outperforms the state-of-the-art online few-shot learning method by an average of 3.6\% on eight image classification datasets with higher inference speed. Furthermore, our model is simple and flexible, serving as a plug-and-play module directly applicable to downstream tasks. Without further fine-tuning, Meta-Adapter obtains notable performance improvements in open-vocabulary object detection and segmentation tasks.
Paper Structure (17 sections, 6 equations, 3 figures, 8 tables)

This paper contains 17 sections, 6 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Comparison of different few-shot learning techniques. The models are trained on the set of base classes and evaluated on the novel classes. The time is measured on a Tesla V100 GPU.
  • Figure 2: Diagram of the proposed Meta-Adapter, which employs a learnable network to refine the category embeddings guided by few-shot images.
  • Figure 3: Relative accuracy improvements of Tip-Adapter and Meta-Adapter in cross-dataset generalization experiments.