Zero-Shot Fine-Grained Image Classification Using Large Vision-Language Models
Md. Atabuzzaman, Andrew Zhang, Chris Thomas
TL;DR
This work tackles zero-shot fine-grained image classification by reframing the task as visual question answering with Large Vision-Language Models (LVLMs). It introduces an iterative multiple-choice question-answering (MCQA) pipeline and a lightweight attention intervention to better align early visual cues with deep semantic grounding, plus curated, attribute-rich class descriptions to bridge the gap between visuals and discriminative attributes. The authors demonstrate substantial, consistent improvements over prior zero-shot methods across five fine-grained benchmarks, achieving state-of-the-art performance in many settings, and analyze trade-offs between iterative MCQA and all-at-once inference, model scale, and description quality. The results highlight the practical potential of LVLM-driven zero-shot fine-grained classification, especially when combined with precise descriptions and targeted attention guidance, with implications for scalable, training-free recognition in visually similar categories.
Abstract
Large Vision-Language Models (LVLMs) have demonstrated impressive performance on vision-language reasoning tasks. However, their potential for zero-shot fine-grained image classification, a challenging task requiring precise differentiation between visually similar categories, remains underexplored. We present a novel method that transforms zero-shot fine-grained image classification into a visual question-answering framework, leveraging LVLMs' comprehensive understanding capabilities rather than relying on direct class name generation. We enhance model performance through a novel attention intervention technique. We also address a key limitation in existing datasets by developing more comprehensive and precise class description benchmarks. We validate the effectiveness of our method through extensive experimentation across multiple fine-grained image classification benchmarks. Our proposed method consistently outperforms the current state-of-the-art (SOTA) approach, demonstrating both the effectiveness of our method and the broader potential of LVLMs for zero-shot fine-grained classification tasks. Code and Datasets: https://github.com/Atabuzzaman/Fine-grained-classification
