Table of Contents
Fetching ...

The Power of One: A Single Example is All it Takes for Segmentation in VLMs

Mir Rayat Imtiaz Hossain, Mennatullah Siam, Leonid Sigal, James J. Little

TL;DR

This work shows that a single visual example per category, coupled with an entropy-based InfoScore ranking of text-to-image attention layers and image-text scoring, can substantially boost open-vocabulary segmentation with vision-language models. It introduces two practical modes: a training-free pipeline that selects top layers and re-weights heatmaps, and a one-shot fine-tuning regime that updates a compact parameter subset to form an ensemble across layers/prompts. The approach achieves state-of-the-art open-vocabulary performance on multiple benchmarks and demonstrates strong generalization across BLIP, ALBEF, and LLaVA, with scalability to additional VLMs. By reducing reliance on extensive prompts and labeled segmentation data, it offers a flexible, scalable path toward robust open-vocabulary segmentation in real-world settings.

Abstract

Large-scale vision-language models (VLMs), trained on extensive datasets of image-text pairs, exhibit strong multimodal understanding capabilities by implicitly learning associations between textual descriptions and image regions. This emergent ability enables zero-shot object detection and segmentation, using techniques that rely on text-image attention maps, without necessarily training on abundant labeled segmentation datasets. However, performance of such methods depends heavily on prompt engineering and manually selected layers or head choices for the attention layers. In this work, we demonstrate that, rather than relying solely on textual prompts, providing a single visual example for each category and fine-tuning the text-to-image attention layers and embeddings significantly improves the performance. Additionally, we propose learning an ensemble through few-shot fine-tuning across multiple layers and/or prompts. An entropy-based ranking and selection mechanism for text-to-image attention layers is proposed to identify the top-performing layers without the need for segmentation labels. This eliminates the need for hyper-parameter selection of text-to-image attention layers, providing a more flexible and scalable solution for open-vocabulary segmentation. We show that this approach yields strong zero-shot performance, further enhanced through fine-tuning with a single visual example. Moreover, we demonstrate that our method and findings are general and can be applied across various vision-language models (VLMs).

The Power of One: A Single Example is All it Takes for Segmentation in VLMs

TL;DR

This work shows that a single visual example per category, coupled with an entropy-based InfoScore ranking of text-to-image attention layers and image-text scoring, can substantially boost open-vocabulary segmentation with vision-language models. It introduces two practical modes: a training-free pipeline that selects top layers and re-weights heatmaps, and a one-shot fine-tuning regime that updates a compact parameter subset to form an ensemble across layers/prompts. The approach achieves state-of-the-art open-vocabulary performance on multiple benchmarks and demonstrates strong generalization across BLIP, ALBEF, and LLaVA, with scalability to additional VLMs. By reducing reliance on extensive prompts and labeled segmentation data, it offers a flexible, scalable path toward robust open-vocabulary segmentation in real-world settings.

Abstract

Large-scale vision-language models (VLMs), trained on extensive datasets of image-text pairs, exhibit strong multimodal understanding capabilities by implicitly learning associations between textual descriptions and image regions. This emergent ability enables zero-shot object detection and segmentation, using techniques that rely on text-image attention maps, without necessarily training on abundant labeled segmentation datasets. However, performance of such methods depends heavily on prompt engineering and manually selected layers or head choices for the attention layers. In this work, we demonstrate that, rather than relying solely on textual prompts, providing a single visual example for each category and fine-tuning the text-to-image attention layers and embeddings significantly improves the performance. Additionally, we propose learning an ensemble through few-shot fine-tuning across multiple layers and/or prompts. An entropy-based ranking and selection mechanism for text-to-image attention layers is proposed to identify the top-performing layers without the need for segmentation labels. This eliminates the need for hyper-parameter selection of text-to-image attention layers, providing a more flexible and scalable solution for open-vocabulary segmentation. We show that this approach yields strong zero-shot performance, further enhanced through fine-tuning with a single visual example. Moreover, we demonstrate that our method and findings are general and can be applied across various vision-language models (VLMs).

Paper Structure

This paper contains 22 sections, 6 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Segmentation using vision foundational models (VFMs) can be broadly categorized based on their pre-training: the right half includes models pretrained specifically for segmentation with dense annotations, while the left half comprises models not pretrained for segmentation tasks. Each category is further divided into four distinct approaches (clockwise): the top right contains models trained on extensive data for segmentation; the bottom right includes models pretrained for segmentation then evaluated on novel categories with few-shot data; the bottom left represents training-free segmentation models, typically vision-language models (VLMs) trained solely on image-text pairs; and the top left (ours) features a hybrid approach allowing both training-free inference and one-shot fine-tuning.
  • Figure 2: Model Overview. Our segmentation framework leverages VLMs trained on image-text pairs, supporting training-free inference and one-shot fine-tuning. For training-free inference, given class names and a query image, we extract text-to-image attention maps from top-$K$ layers (e.g., Layer 2 and Layer L), selected via InfoScore (see Sec. \ref{['subsec:infoscore']}). These maps are re-weighted with class VLM scores to filter irrelevant categories (see Sec. \ref{['subsec:ca_map_description']}) and used for prediction. In one-shot fine-tuning, we adjust text embeddings and top-$K$ attention layer parameters (see Sec. \ref{['subsec:fine-tune']}) to further improve the performance.
  • Figure 3: Illustration of the InfoScore Metric on BLIP. The mIoU Rank reflects the descending order of mIoU values (in red) derived from cross-attention maps for each standalone layer (labeled Layer$N$, top) on the PASCAL VOC 2012 validation set (1449 images), compared to the predicted InfoS Rank (bottom) based on our InfoScore metric (in blue) requiring no annotations. Most InfoScore rankings align with mIoU Rankings, with minor displacements of $\pm2$ positions highlighted in bold, except for four layers. Empirically, the top-1 and top-2 layers are correctly identified and consistently deliver better performance across four datasets and three VLMs.
  • Figure 4: Qualitative Results on PASCAL-21: Shown are results from zero-shot model w/o image-text scoring (3rd column), zero-shot model w/ image-text scoring (4th column), and one-shot fine-tuning (5th column). The final row shows an example where the zero-shot prediction outperformed the fine-tuned one-shot model. For all variants, we ensemble the top-2 layers ranked by InfoScore.
  • Figure 5: Ablation on second pair. Ablation study on the mIoU pairing the Top-1 layer (Layer3) with all other layers using BLIP and evaluated on PASCAL-21.
  • ...and 1 more figures