Table of Contents
Fetching ...

Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection

Adhemar de Senneville, Xavier Bou, Jérémy Anger, Rafael Grompone, Gabriele Facciolo

Abstract

Current Large Vision Language Models (LVLMs) excel at many zero-shot tasks like image captioning, visual question answering and OCR. However, these same models suffer from poor performance at image classification tasks, underperforming against CLIP-based methods. Notably, this gap is surprising because many LVLMs use CLIP-pretrained vision encoders. Yet LVLMs are not inherently limited by CLIP's architecture with independent vision and text encoders. In CLIP, this separation biases classification toward class-name matching rather than joint visual-text reasoning. In this paper we show that, despite their poor raw performance, LVLMs can improve visual feature class separability at inference using prompt conditioning, and LVLMs' internal representations, especially attention heads, can outperform the model itself at zero-shot and few-shot classification. We introduce Head Ensemble Classifiers (HEC) to bridge the performance gap between CLIP-based and LVLM-based classification methods. Inspired by Gaussian Discriminant Analysis, HEC ranks the most discriminative vision and text heads and combines them into a training-free classifier. We show that HEC achieves state-of-the-art performance in few-shot and zero-shot classification across 12 datasets.

Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection

Abstract

Current Large Vision Language Models (LVLMs) excel at many zero-shot tasks like image captioning, visual question answering and OCR. However, these same models suffer from poor performance at image classification tasks, underperforming against CLIP-based methods. Notably, this gap is surprising because many LVLMs use CLIP-pretrained vision encoders. Yet LVLMs are not inherently limited by CLIP's architecture with independent vision and text encoders. In CLIP, this separation biases classification toward class-name matching rather than joint visual-text reasoning. In this paper we show that, despite their poor raw performance, LVLMs can improve visual feature class separability at inference using prompt conditioning, and LVLMs' internal representations, especially attention heads, can outperform the model itself at zero-shot and few-shot classification. We introduce Head Ensemble Classifiers (HEC) to bridge the performance gap between CLIP-based and LVLM-based classification methods. Inspired by Gaussian Discriminant Analysis, HEC ranks the most discriminative vision and text heads and combines them into a training-free classifier. We show that HEC achieves state-of-the-art performance in few-shot and zero-shot classification across 12 datasets.
Paper Structure (40 sections, 17 equations, 11 figures, 8 tables)

This paper contains 40 sections, 17 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: CLIP-based vs HEC (Ours). CLIP-based methods encode class names and support set images independently to construct a zero-shot and a few-shot classifier respectively. Our method keeps the same two-classifier structure. However, both distributions go through a shared LLM decoder which can be conditioned by a text prompt to include guidance on the domain or the classes of the support set. The few-shot classifier (HEC-V) builds on the distribution of a sparse set of heads from the LLM decoder. This subset, which we refer to as vision-heads, is selected using Gaussian Discriminant Analysis bishop2006prml (\ref{['fig:main']}) . The zero-shot classifier (HEC-T) builds on the distribution of another sparse set of heads, which we refer to as text-heads. Similarly to CLIP-based methods, the two classifiers can be combined in a single one (HEC-VT) by adding their output probabilities.
  • Figure 2: Overview of HEC-V. Given a prompt, we first encode all the images from the support set with the LVLM. We then extract the distribution of attention vectors \ref{['eq:head_extract']} for the last token across all heads in every layer. Then, based on a Gaussian Discriminant Analysis bishop2006prml, we rank each head based on its class separability. Lastly, given a query image, we ensemble the predictions of the top $k$ heads for that task by averaging their class probabilities.
  • Figure 3: Experiments to identify where the best classification representation lies in LVLMs. At different locations of Qwen2-VL, we compute linear probing accuracy averaged over a thousand 10-way 4-shot tasks across 10 datasets using Class conditioning. (a) Although vision tokens yield strong accuracy early on, inherited from the CLIP vision transformer, the last token builds better representations by integrating multimodal features from vision and text prompt tokens. (b) For each few-shot setup, a small number of top heads called vision-heads yield better performance than the summary token. (c) Similarly, for each zero-shot setup, a small number of top heads called text-heads yield better performance than the summary token.
  • Figure 4: Top head attention map. We concatenate bird MD_BIRDS and aircraft CD_aircraft datasets images horizontally in one support set. We then select the top vision-head for bird classification using the prompt What type of bird is this? and do the same for plane using the prompt What type of plane is this?. The attention map of the bird (left) and plane (right) top vision-head is overlaid on top of the image.
  • Figure 5: Ablation studies. Prompt Conditioning (left) and Head Ranking (right).
  • ...and 6 more figures