Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs
Dmitry Demidov, Zaigham Zaheer, Zongyan Han, Omkar Thawakar, Rao Anwer
TL;DR
FiNDR introduces a fully automated vocabulary-free fine-grained visual recognition system powered by reasoning-augmented LMMs. It generates descriptive candidate class names via reasoning, refines them with a vision-language model, and couples textual and visual prototypes into a lightweight classifier, all without predefined vocabularies. The approach achieves state-of-the-art results on multiple fine-grained benchmarks and even surpasses zero-shot baselines that use ground-truth names, challenging the assumption that human-curated vocabularies are optimal. Through extensive ablations and prompt analyses, FiNDR demonstrates that open-source LMMs with careful prompting can match private models, offering scalable, interpretable open-world recognition. The work highlights prompting strategy as a practical bridge to high-performance vocabulary-free FGVR and provides actionable guidelines for leveraging public LMMs.
Abstract
Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem are limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system operates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time. Extensive experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art performance under the vocabulary-free setting, with a significant relative margin of up to 18.8% over previous approaches. Remarkably, the proposed method surpasses zero-shot baselines that exploit pre-defined ground-truth names, challenging the assumption that human-curated vocabularies define an upper bound. Additionally, we show that carefully curated prompts enable open-source LMMs to match proprietary counterparts. These findings establish reasoning-augmented LMMs as an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. The source code is available on github.com/demidovd98/FiNDR.
