Table of Contents
Fetching ...

Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models

Tian Meng, Yang Tao, Ruilin Lyu, Wuliang Yin

TL;DR

The paper tackles FS-CS by reframing it as Visual Question Answering using Vision-Language Models, enabling training-free classification and segmentation with only image-level labels. The proposed VISE framework collaborates with off-the-shelf vision tools (YOLO for detection and SAM for segmentation) to convert FS-CS into a VQA task, guided by in-context prompts and chain-of-thought reasoning. Empirical results on Pascal-5i and COCO-20i show state-of-the-art segmentation performance (mIoU) and competitive classification accuracy, underscoring the benefit of integrating VLMs with specialized vision tools. The approach offers a modular, scalable path toward efficient FS-CS across domains without extensive retraining, with supplementary analyses detailing errors and successful cases to inform future refinements.

Abstract

The task of few-shot image classification and segmentation (FS-CS) involves classifying and segmenting target objects in a query image, given only a few examples of the target classes. We introduce the Vision-Instructed Segmentation and Evaluation (VISE) method that transforms the FS-CS problem into the Visual Question Answering (VQA) problem, utilising Vision-Language Models (VLMs), and addresses it in a training-free manner. By enabling a VLM to interact with off-the-shelf vision models as tools, the proposed method is capable of classifying and segmenting target objects using only image-level labels. Specifically, chain-of-thought prompting and in-context learning guide the VLM to answer multiple-choice questions like a human; vision models such as YOLO and Segment Anything Model (SAM) assist the VLM in completing the task. The modular framework of the proposed method makes it easily extendable. Our approach achieves state-of-the-art performance on the Pascal-5i and COCO-20i datasets.

Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models

TL;DR

The paper tackles FS-CS by reframing it as Visual Question Answering using Vision-Language Models, enabling training-free classification and segmentation with only image-level labels. The proposed VISE framework collaborates with off-the-shelf vision tools (YOLO for detection and SAM for segmentation) to convert FS-CS into a VQA task, guided by in-context prompts and chain-of-thought reasoning. Empirical results on Pascal-5i and COCO-20i show state-of-the-art segmentation performance (mIoU) and competitive classification accuracy, underscoring the benefit of integrating VLMs with specialized vision tools. The approach offers a modular, scalable path toward efficient FS-CS across domains without extensive retraining, with supplementary analyses detailing errors and successful cases to inform future refinements.

Abstract

The task of few-shot image classification and segmentation (FS-CS) involves classifying and segmenting target objects in a query image, given only a few examples of the target classes. We introduce the Vision-Instructed Segmentation and Evaluation (VISE) method that transforms the FS-CS problem into the Visual Question Answering (VQA) problem, utilising Vision-Language Models (VLMs), and addresses it in a training-free manner. By enabling a VLM to interact with off-the-shelf vision models as tools, the proposed method is capable of classifying and segmenting target objects using only image-level labels. Specifically, chain-of-thought prompting and in-context learning guide the VLM to answer multiple-choice questions like a human; vision models such as YOLO and Segment Anything Model (SAM) assist the VLM in completing the task. The modular framework of the proposed method makes it easily extendable. Our approach achieves state-of-the-art performance on the Pascal-5i and COCO-20i datasets.
Paper Structure (21 sections, 5 equations, 7 figures, 3 tables)

This paper contains 21 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Few-Shot Classification & Segmentation Task Solved by Vision Language Models. By providing vision tools to VLM like GPT-4Vision, it can solve the task of Few-Shot Image Classification and Segmentation with only image-level label in a training-free manner.
  • Figure 2: VISE framework of VLM using visual tools to solve the FS-CS task. First, a N-way K-shot FS-CS task is sampled from the database. Query images are given to object detection tool to get bounding boxes. Then, according to the support set, the original FS-CS task is transformed to a multi-choice VQA task. Last, image segmentation tool is used to obtain the ultimate segmentation mask of query set.
  • Figure 3: An example of 2-way 1-shot FS-CS task in COCO-20i Dataset
  • Figure 4: The VQA formulating for VLM.
  • Figure 5: Classification mistakes. The location of error is marked in red, ambiguous conclusion is marked in yellow, and correct result is marked in green.
  • ...and 2 more figures