Table of Contents
Fetching ...

Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning

Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Zhimeng Huang, Yuhua Li

TL;DR

This work identifies template-sample similarity (TSS) as a source of bias in CLIP that can mislead few-shot classification. It introduces empty prompts to capture unbiased template features and proposes a two-stage training framework: pretraining with a template bias calibration loss and few-shot finetuning with a bias-aware objective. Across 11 datasets, the approach reduces template-induced performance fluctuations and improves accuracy, demonstrating stronger robustness than existing prompt- and adapter-based methods. The results highlight the importance of debiasing template structure in visual-language models for reliable few-shot learning.

Abstract

The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of "emptiness" without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues. Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness. The repository of this project is available at https://github.com/zhenyuZ-HUST/Decoupling-Template-Bias-in-CLIP.

Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning

TL;DR

This work identifies template-sample similarity (TSS) as a source of bias in CLIP that can mislead few-shot classification. It introduces empty prompts to capture unbiased template features and proposes a two-stage training framework: pretraining with a template bias calibration loss and few-shot finetuning with a bias-aware objective. Across 11 datasets, the approach reduces template-induced performance fluctuations and improves accuracy, demonstrating stronger robustness than existing prompt- and adapter-based methods. The results highlight the importance of debiasing template structure in visual-language models for reliable few-shot learning.

Abstract

The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of "emptiness" without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues. Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness. The repository of this project is available at https://github.com/zhenyuZ-HUST/Decoupling-Template-Bias-in-CLIP.

Paper Structure

This paper contains 26 sections, 7 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Impact of template bias on CLIP classification. (a) A biased prompt (“a photo of a ”) introduces extra template–sample similarity that skews attention toward irrelevant regions and inflates incorrect class scores—even when the class name matches the object. (b) An unbiased prompt removes template bias, so the model focuses on genuinely discriminative features and the model’s prediction relies solely on class–sample similarity and correct prompt–sample similarity. .
  • Figure 2: (Left) Correlation between Template-Sample Similarity and classification accuracy on EuroSAT (correlation coefficients is included in the legend). (Middle) The evolution of the absolute value of correlation coefficients between classification accuracy and template-sample similarity over the course of 1-shot training on EuroSAT. (Right) The effect of different templates on overall performance, shown both before and after applying template correction on EuroSAT.
  • Figure 3: The overall framework for template correction involves three main stages. (a) Empty Prompts Generation: A diverse set of empty prompts is manually curated to help identify potential template-induced biases. (b) Pretraining Initialization: These empty prompts are used to detect and correct biases within the CLIP model during the pretraining phase. (c) Few-shot Fine-tuning Calibration: Finally, the model undergoes fine-tuning with few-shot samples to calibrate its performance and improve classification accuracy.
  • Figure 4: The relationship between template-sample similarity and classification accuracy under different training methods. Result for 4-shot finetuning on EuroSAT, seed 1.
  • Figure 5: Impact of empty prompt count on performance on EuroSAT.
  • ...and 1 more figures