Table of Contents
Fetching ...

Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs

Dmitry Demidov, Zaigham Zaheer, Zongyan Han, Omkar Thawakar, Rao Anwer

TL;DR

FiNDR introduces a fully automated vocabulary-free fine-grained visual recognition system powered by reasoning-augmented LMMs. It generates descriptive candidate class names via reasoning, refines them with a vision-language model, and couples textual and visual prototypes into a lightweight classifier, all without predefined vocabularies. The approach achieves state-of-the-art results on multiple fine-grained benchmarks and even surpasses zero-shot baselines that use ground-truth names, challenging the assumption that human-curated vocabularies are optimal. Through extensive ablations and prompt analyses, FiNDR demonstrates that open-source LMMs with careful prompting can match private models, offering scalable, interpretable open-world recognition. The work highlights prompting strategy as a practical bridge to high-performance vocabulary-free FGVR and provides actionable guidelines for leveraging public LMMs.

Abstract

Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem are limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system operates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time. Extensive experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art performance under the vocabulary-free setting, with a significant relative margin of up to 18.8% over previous approaches. Remarkably, the proposed method surpasses zero-shot baselines that exploit pre-defined ground-truth names, challenging the assumption that human-curated vocabularies define an upper bound. Additionally, we show that carefully curated prompts enable open-source LMMs to match proprietary counterparts. These findings establish reasoning-augmented LMMs as an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. The source code is available on github.com/demidovd98/FiNDR.

Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs

TL;DR

FiNDR introduces a fully automated vocabulary-free fine-grained visual recognition system powered by reasoning-augmented LMMs. It generates descriptive candidate class names via reasoning, refines them with a vision-language model, and couples textual and visual prototypes into a lightweight classifier, all without predefined vocabularies. The approach achieves state-of-the-art results on multiple fine-grained benchmarks and even surpasses zero-shot baselines that use ground-truth names, challenging the assumption that human-curated vocabularies are optimal. Through extensive ablations and prompt analyses, FiNDR demonstrates that open-source LMMs with careful prompting can match private models, offering scalable, interpretable open-world recognition. The work highlights prompting strategy as a practical bridge to high-performance vocabulary-free FGVR and provides actionable guidelines for leveraging public LMMs.

Abstract

Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem are limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system operates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time. Extensive experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art performance under the vocabulary-free setting, with a significant relative margin of up to 18.8% over previous approaches. Remarkably, the proposed method surpasses zero-shot baselines that exploit pre-defined ground-truth names, challenging the assumption that human-curated vocabularies define an upper bound. Additionally, we show that carefully curated prompts enable open-source LMMs to match proprietary counterparts. These findings establish reasoning-augmented LMMs as an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. The source code is available on github.com/demidovd98/FiNDR.

Paper Structure

This paper contains 36 sections, 6 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Qualitative comparison of a predicted label vocabulary between the previous state-of-the-art method finer and our approach (FiNDR) on the Oxford Pets dataset. The predicted class names are obtained for the unlabelled discovery set. All predictions are sorted by the similarity score with top-3 and bottom-3 labels depicted for each method. Legend: predictions in green are correct, predictions in orange are partially correct, predictions in red are incorrect, predictions in pink are failed.
  • Figure 2: Overview of our FiNDR framework for vocabulary-free fine-grained image classification. The pipeline operates in two primary phases: vocabulary discovery and classifier preparation. In the vocabulary discovery phase, a large multi-modal model (LMM) generates dataset-level meta information from an unlabelled discovery set ($D_{\text{disc}}$), followed by initial class name predictions ($\tilde{c}$). During classifier preparation, the class names are refined with a vision-language model (VLM), which ranks and filters these candidates to produce a refined vocabulary set ($\tilde{c}^{*}$). Then, a modalities coupling step combines visual and textual embeddings to create a unified vision-language classifier ($W_{\text{VL}}$). At inference time, this classifier is used to assign interpretable and fine-grained semantic labels to unseen test images ($D_{\text{test}}$), without relying on any predefined vocabulary.
  • Figure 3: Analysis and comparison of the class names generated by our FiNDR and the ground truth labels from Oxford Flowers and Stanford Cars datasets. It can be seen that our framework predicts correct label names, however, these names do not always fully match the biased, human-provided ground truth labels. This explains the higher increase in clustering accuracy, but a modest improvement in the semantic accuracy metric (also see Tab. \ref{['tab:main_results']}).
  • Figure 4: Example of the meta prompt input and output for the CUB-200 dataset.
  • Figure 5: Example of the main prompt input and output for the CUB-200 dataset.
  • ...and 4 more figures