Table of Contents
Fetching ...

UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval

Hongyu Guo, Xiangzhao Hao, Jiarui Guo, Haiyun Guo, Jinqiao Wang, Tat-Seng Chua

TL;DR

UniFGVC addresses few-shot fine-grained visual classification by replacing parameter fine-tuning with a training-free, multimodal retrieval framework. It introduces the Category-Discriminative Visual Captioner (CDV-Captioner) to generate attribute-rich, structured descriptions guided by reference exemplars, which are used to build a multimodal category template gallery. Inference relies on nearest-neighbor retrieval in a fused visual-textual space, enabling robust discrimination without costly task-specific training. Experiments across 12 FGVC datasets show consistent gains over strong CLIP-based methods and competitive performance against fully supervised MLLM models, with the added benefit of zero-marginal-cost extension to new categories.

Abstract

Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.

UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval

TL;DR

UniFGVC addresses few-shot fine-grained visual classification by replacing parameter fine-tuning with a training-free, multimodal retrieval framework. It introduces the Category-Discriminative Visual Captioner (CDV-Captioner) to generate attribute-rich, structured descriptions guided by reference exemplars, which are used to build a multimodal category template gallery. Inference relies on nearest-neighbor retrieval in a fused visual-textual space, enabling robust discrimination without costly task-specific training. Experiments across 12 FGVC datasets show consistent gains over strong CLIP-based methods and competitive performance against fully supervised MLLM models, with the added benefit of zero-marginal-cost extension to new categories.

Abstract

Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.

Paper Structure

This paper contains 18 sections, 6 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Overview of different few-shot FGVC paradigms. (a) CLIP-based methods rely on fine-tuning to achieve fine-grained discrimination but show limited cross-domain generalization. (b) MLLM-based methods leverage generated image captions in model training to enhance fine-grained recognition, but the captions are often generic or hallucinated. (c) Our proposed UniFGVC shifts the paradigm from parameter fine-tuning to training-free retrieval-based inference, reframing the task as a multimodal retrieval problem with predefined category templates. Central to UniFGVC is the CDV-Captioner, which leverages chain-of-thought reasoning and visual references to mitigate hallucination and produce structured attribute-enriched textual descriptions.
  • Figure 2: An overview of the proposed UniFGVC. UniFGVC is a universal, training-free framework for few-shot fine-grained visual classification, which reformulates the task as a multimodal retrieval problem using structured attribute-aware representations. The CDV-Captioner progressively prompts the MLLM to output the structured fine-grained attribute-aware feature description of the target image, by integrating the category-related linguistic priors inherent in the MLLM and visual priors derived from reference images.
  • Figure 3: Visualization examples of structured attribute descriptions generated by Raw-Desc, Sum-Desc and Similar-Ref approaches.
  • Figure 4: Visualization of UniFGVC across five datasets, including OxfordPets, CUBbirds, Food101, StanfordCars and Flowers102. (a) Visual features, features extracted directly from the image encoder. (b) Multimodal Feature, features enhanced with CDV-Captioner by generating structured textual descriptions, then fused with image features.
  • Figure 5: Qualitative retrieval results under the 1-shot setting on three representative datasets: OxfordPets, CUBbirds and Flowers102. We compare three configurations: Image-Only, Raw-Desc and UniFGVC (Similar-Ref).