Making Large Vision Language Models to be Good Few-shot Learners

Fan Liu; Wenwen Cai; Jian Huo; Chuanyi Zhang; Delong Chen; Jun Zhou

Making Large Vision Language Models to be Good Few-shot Learners

Fan Liu, Wenwen Cai, Jian Huo, Chuanyi Zhang, Delong Chen, Jun Zhou

TL;DR

The paper tackles few-shot classification (FSC) with large vision-language models (LVLMs) by identifying key limitations such as insufficient learning from support data and positional biases. It introduces a meta-learning based instruction fine-tuning framework that incorporates label augmentation via character perturbation and an adaptive attribute description generator for candidate selection, creating a robust inference pipeline that emphasizes support information. Through extensive experiments on eight FSC benchmarks, the approach achieves state-of-the-art performance in both general and fine-grained settings and demonstrates strong improvements for training-free LVLMs via the candidate selection strategy. The findings suggest LVLMs can be effectively adapted to FSC tasks with a single fine-tuning and semantic augmentation, enabling practical, scalable few-shot vision-language reasoning.

Abstract

Few-shot classification (FSC) is a fundamental yet challenging task in computer vision that involves recognizing novel classes from limited data. While previous methods have focused on enhancing visual features or incorporating additional modalities, Large Vision Language Models (LVLMs) offer a promising alternative due to their rich knowledge and strong visual perception. However, LVLMs risk learning specific response formats rather than effectively extracting useful information from support data in FSC tasks. In this paper, we investigate LVLMs' performance in FSC and identify key issues such as insufficient learning and the presence of severe positional biases. To tackle the above challenges, we adopt the meta-learning strategy to teach models "learn to learn". By constructing a rich set of meta-tasks for instruction fine-tuning, LVLMs enhance the ability to extract information from few-shot support data for classification. Additionally, we further boost LVLM's few-shot learning capabilities through label augmentation and candidate selection in the fine-tuning and inference stage, respectively. Label augmentation is implemented via a character perturbation strategy to ensure the model focuses on support information. Candidate selection leverages attribute descriptions to filter out unreliable candidates and simplify the task. Extensive experiments demonstrate that our approach achieves superior performance on both general and fine-grained datasets. Furthermore, our candidate selection strategy has been proven beneficial for training-free LVLMs.

Making Large Vision Language Models to be Good Few-shot Learners

TL;DR

Abstract

Paper Structure (22 sections, 4 equations, 6 figures, 4 tables)

This paper contains 22 sections, 4 equations, 6 figures, 4 tables.

Introduction
Related work
Few-Shot Learning
LVLM Instruction Tuning
Method
Problem Definition
Instruction Tuning with Label Augmentation
Attribute Description Generation
Attribute-Based Candidate Selection
Experiment
Implementation Details
Datasets
Architecture and Training Details
Evaluation Protocol
Comparison with the State-of-the-Art
...and 7 more sections

Figures (6)

Figure 1: The challenges of FSL. Typical models often suffer from poor generalization, leading to incorrect classifications. Directly applying LVLMs to FSL also encounters positional bias that models favor the first option they encounter.
Figure 2: Overview of our approach. We construct meta-task instructions for the dataset and fine-tune LVLM using character perturbation as label augmentation. In the inference phase, we first generate attribute descriptions for each image in the meta-task instructions through an adaptive framework. Then, we leverage these descriptions to select candidate classes. If the model's initial inference does not match any of the candidates, we reorganize the meta-task instructions and query the model again for the final inference.
Figure 3: Illustration of position bias on CUB and Flowers under the 5-way setting: Gold Balanced means the gold answers are evenly distributed across all five candidate positions. LVLM’s Answer Position shows the actual distribution of answers provided by the LVLM. Gold Fixed indicates that gold answers are fixed in a specific position. Gold Concentrated Position 1st indicates that all gold answers are fixed in the first candidate position.
Figure 4: Comparison of answer position distributions between our method and Untuned LVLM. NSD values indicate the normalized standard deviation between the model's actual output positions and the uniformly sampled gold positions.
Figure 5: Left: Accuracy of each individual attribute description. Right: Top-k accuracy of aggregated attributes.
...and 1 more figures

Making Large Vision Language Models to be Good Few-shot Learners

TL;DR

Abstract

Making Large Vision Language Models to be Good Few-shot Learners

Authors

TL;DR

Abstract

Table of Contents

Figures (6)