Table of Contents
Fetching ...

AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning

Yuwei Tang, Zhenyi Lin, Qilong Wang, Pengfei Zhu, Qinghua Hu

TL;DR

This work tackles the variability and inefficiency of CLIP-based few-shot learning by reframing it through logit bias. It introduces AMU-Tuning, which learns an effective logit bias using complementary auxiliary features, a feature-initialized multi-branch LP predictor, and an uncertainty-driven fusion to adaptively merge zero-shot CLIP with the learned bias. Through a principled analysis of logit bias components and extensive experiments across eleven downstream tasks and multiple OOD benchmarks, AMU-Tuning achieves state-of-the-art performance with improved efficiency. The approach provides a practical pathway to harness large vision-language models for few-shot tasks, with broad implications for rapid, scalable adaptation in vision-language systems.

Abstract

Recently, pre-trained vision-language models (e.g., CLIP) have shown great potential in few-shot learning and attracted a lot of research interest. Although efforts have been made to improve few-shot ability of CLIP, key factors on the effectiveness of existing methods have not been well studied, limiting further exploration of CLIP's potential in few-shot learning. In this paper, we first introduce a unified formulation to analyze CLIP-based few-shot learning methods from a perspective of logit bias, which encourages us to learn an effective logit bias for further improving performance of CLIP-based few-shot learning methods. To this end, we disassemble three key components involved in computation of logit bias (i.e., logit features, logit predictor, and logit fusion) and empirically analyze the effect on performance of few-shot classification. Based on analysis of key components, this paper proposes a novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot classification. Specifically, our AMU-Tuning predicts logit bias by exploiting the appropriate $\underline{\textbf{A}}$uxiliary features, which are fed into an efficient feature-initialized linear classifier with $\underline{\textbf{M}}$ulti-branch training. Finally, an $\underline{\textbf{U}}$ncertainty-based fusion is developed to incorporate logit bias into CLIP for few-shot classification. The experiments are conducted on several widely used benchmarks, and the results show AMU-Tuning clearly outperforms its counterparts while achieving state-of-the-art performance of CLIP-based few-shot learning without bells and whistles.

AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning

TL;DR

This work tackles the variability and inefficiency of CLIP-based few-shot learning by reframing it through logit bias. It introduces AMU-Tuning, which learns an effective logit bias using complementary auxiliary features, a feature-initialized multi-branch LP predictor, and an uncertainty-driven fusion to adaptively merge zero-shot CLIP with the learned bias. Through a principled analysis of logit bias components and extensive experiments across eleven downstream tasks and multiple OOD benchmarks, AMU-Tuning achieves state-of-the-art performance with improved efficiency. The approach provides a practical pathway to harness large vision-language models for few-shot tasks, with broad implications for rapid, scalable adaptation in vision-language systems.

Abstract

Recently, pre-trained vision-language models (e.g., CLIP) have shown great potential in few-shot learning and attracted a lot of research interest. Although efforts have been made to improve few-shot ability of CLIP, key factors on the effectiveness of existing methods have not been well studied, limiting further exploration of CLIP's potential in few-shot learning. In this paper, we first introduce a unified formulation to analyze CLIP-based few-shot learning methods from a perspective of logit bias, which encourages us to learn an effective logit bias for further improving performance of CLIP-based few-shot learning methods. To this end, we disassemble three key components involved in computation of logit bias (i.e., logit features, logit predictor, and logit fusion) and empirically analyze the effect on performance of few-shot classification. Based on analysis of key components, this paper proposes a novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot classification. Specifically, our AMU-Tuning predicts logit bias by exploiting the appropriate uxiliary features, which are fed into an efficient feature-initialized linear classifier with ulti-branch training. Finally, an ncertainty-based fusion is developed to incorporate logit bias into CLIP for few-shot classification. The experiments are conducted on several widely used benchmarks, and the results show AMU-Tuning clearly outperforms its counterparts while achieving state-of-the-art performance of CLIP-based few-shot learning without bells and whistles.
Paper Structure (26 sections, 14 equations, 10 figures, 9 tables)

This paper contains 26 sections, 14 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Comparison of the existing CLIP-based few-shot learning methods in terms of architecture design.
  • Figure 2: Overview of our proposed AMU-Tuning method for CLIP-based few-shot classification. Specifically, our AMU-Tuning exploits the complementary Auxiliary features to compute logit bias. Then, an efficient feature-initialized LP with Multi-branch training is presented to improve performance of logit predictor by better exploring the auxiliary features. Finally, we develop a Uncertainty-based fusion by considering prediction confidence of zero-shot CLIP, which adaptively incorporates logit bias into CLIP for few-shot classification.
  • Figure 3: Results of different logit predictors on ImageNet-1K.
  • Figure 4: (a) Results of Tip-Adapter-F and CaFo with various $\beta$ on ImageNet-1K and OxfordPets. (b) Visualization of the distribution of max logits for zero-shot CLIP on ImageNet-1K.
  • Figure 5: Comparison (in %) of different SOTA methods under various few-shot settings on ten downstream tasks.
  • ...and 5 more figures