What Makes Good Few-shot Examples for Vision-Language Models?

Zhaojun Guo; Jinghui Lu; Xuejing Liu; Rui Zhao; ZhenXing Qian; Fei Tan

What Makes Good Few-shot Examples for Vision-Language Models?

Zhaojun Guo, Jinghui Lu, Xuejing Liu, Rui Zhao, ZhenXing Qian, Fei Tan

TL;DR

This paper shows that few-shot learning outcomes for vision-language models are highly sensitive to the chosen training examples, often more than to the prompting strategy. It critically evaluates standard Active Learning methods (Entropy, Margin) and finds them largely ineffective in VL few-shot settings, proposing two data-selection strategies: Gaussian Monte Carlo and Representativeness (REPRE). Across CoOp, MaPLe, and Linear Probe on 11 diverse datasets, these selectors consistently improve performance over random sampling and AL baselines, with REPRE excelling in several configurations. The work highlights that dataset characteristics, such as generality, influence the effectiveness of Monte Carlo, and provides practical guidance for sample-efficient VL fine-tuning and robust prompt-learning designs.

Abstract

Despite the notable advancements achieved by leveraging pre-trained vision-language (VL) models through few-shot tuning for downstream tasks, our detailed empirical study highlights a significant dependence of few-shot learning outcomes on the careful selection of training examples - a facet that has been previously overlooked in research. In this study, we delve into devising more effective strategies for the meticulous selection of few-shot training examples, as opposed to relying on random sampling, to enhance the potential of existing few-shot prompt learning methodologies. To achieve this, we assess the effectiveness of various Active Learning (AL) techniques for instance selection, such as Entropy and Margin of Confidence, within the context of few-shot training. Furthermore, we introduce two innovative selection methods - Representativeness (REPRE) and Gaussian Monte Carlo (Montecarlo) - designed to proactively pinpoint informative examples for labeling in relation to pre-trained VL models. Our findings demonstrate that both REPRE and Montecarlo significantly surpass both random selection and AL-based strategies in few-shot training scenarios. The research also underscores that these instance selection methods are model-agnostic, offering a versatile enhancement to a wide array of few-shot training methodologies.

What Makes Good Few-shot Examples for Vision-Language Models?

TL;DR

Abstract

Paper Structure (28 sections, 4 equations, 4 figures, 2 tables)

This paper contains 28 sections, 4 equations, 4 figures, 2 tables.

Introduction
Related work
Vision-Language Models
Instance Selection
Natural Language Processing
Computer Vision
Methods
Problem Formulation
Entropy and Margin of Confidence
Gaussian Monte Carlo (Montecarlo)
Representativeness
Experiment
Datasets
Setting
Few-shot setting
...and 13 more sections

Figures (4)

Figure 1: Illustration of computation of Montecarlo score, the image with a larger score is selected.
Figure 2: Main results of few-shot learning on the 11 datasets in CoOP. The results showcase performance variability across different domains and illustrate the method's generality and adaptability.
Figure 3: Main results of few-shot learning on the 11 datasets in MaPle. The visualizations depict performance across various domains, highlighting the effectiveness and adaptability of the MaPle method.
Figure 4: Main results of few-shot learning on the 11 datasets using the Linear probe approach. The results display the variability in performance across different datasets, illustrating the challenges and potential of the Linear probe in few-shot settings.

What Makes Good Few-shot Examples for Vision-Language Models?

TL;DR

Abstract

What Makes Good Few-shot Examples for Vision-Language Models?

Authors

TL;DR

Abstract

Table of Contents

Figures (4)