Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models
Canshi Wei
TL;DR
The paper addresses the challenge of zero-/few-shot fine-grained image classification, where CLIP-based vision-language models struggle to distinguish semantically similar sub-classes. It introduces CascadeVLM, a cascaded framework that first uses CLIP to filter candidate classes and then leverages large vision-language models (LVLMs) with zero-shot or few-shot prompts to refine predictions, guided by an adaptive entropy threshold for efficiency. Across six fine-grained datasets, CascadeVLM achieves superior zero-shot performance and notable gains in few-shot settings, with key insights into candidate ordering, efficiency trade-offs, and explainable reasoning. The approach demonstrates a practical pathway to integrate CLIP-like backbones with LVLMs for accurate, efficient, and interpretable fine-grained classification in real-world scenarios.
Abstract
Fine-grained image classification, particularly in zero/few-shot scenarios, presents a significant challenge for vision-language models (VLMs), such as CLIP. These models often struggle with the nuanced task of distinguishing between semantically similar classes due to limitations in their pre-trained recipe, which lacks supervision signals for fine-grained categorization. This paper introduces CascadeVLM, an innovative framework that overcomes the constraints of previous CLIP-based methods by effectively leveraging the granular knowledge encapsulated within large vision-language models (LVLMs). Experiments across various fine-grained image datasets demonstrate that CascadeVLM significantly outperforms existing models, specifically on the Stanford Cars dataset, achieving an impressive 85.6% zero-shot accuracy. Performance gain analysis validates that LVLMs produce more accurate predictions for challenging images that CLIPs are uncertain about, bringing the overall accuracy boost. Our framework sheds light on a holistic integration of VLMs and LVLMs for effective and efficient fine-grained image classification.
