Table of Contents
Fetching ...

DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition

Raja Kumar, Arka Sadhu, Ram Nevatia

TL;DR

DiVE-k tackles zero-shot fine-grained image recognition with LVLMs by turning the model’s own top-$K$ predictions into a verifiable MCQ task and training with RL to enforce differential, attribute-grounded reasoning. The method comprises offline top-$k$ option mining followed by GRPO-based MCQ training, and uses a two-step inference pipeline to select the correct option. Empirical results on five standard fine-grained datasets show substantial improvements in base-to-novel generalization, mixed-domain transfer, and few-shot classification, with especially strong gains on CUB and Flowers. The work highlights the importance of sampling options from the model’s distribution, joint vision–text fine-tuning, and controlled inference cost via the hyperparameter $K$, offering a promising direction for robust fine-grained discrimination in LVLMs.

Abstract

Large Vision Language Models (LVLMs) possess extensive text knowledge but struggles to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact-match reward signals are often brittle, encourage memorization of training categories, and fail to elicit differential reasoning needed for generalization to unseen classes. To address this, we propose $\textbf{DiVE-k}$, $\textbf{Di}$fferential $\textbf{V}$isual r$\textbf{E}$asoning using top-$\textbf{k}$ generations, framework that leverages model's own top-k predictions as a training signal. For each training image, DiVE-k creates a multiple-choice question from the model's top-k outputs and uses RL to train the model to select the correct answer. This approach requires the model to perform fine-grained differential reasoning among plausible options and provides a simple, verifiable reward signal that mitigates memorization and improves generalization. Experiments on five standard fine-grained datasets show that our method significantly outperforms existing approaches. In the standard base-to-novel generalization setting, DiVE-k surpasses the QWEN2.5-VL-7B and ViRFT by 10.04% and 6.16% on the Harmonic Mean metric, respectively. Further experiments show similar gains in mixed-domain and few-shot scenarios.

DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition

TL;DR

DiVE-k tackles zero-shot fine-grained image recognition with LVLMs by turning the model’s own top- predictions into a verifiable MCQ task and training with RL to enforce differential, attribute-grounded reasoning. The method comprises offline top- option mining followed by GRPO-based MCQ training, and uses a two-step inference pipeline to select the correct option. Empirical results on five standard fine-grained datasets show substantial improvements in base-to-novel generalization, mixed-domain transfer, and few-shot classification, with especially strong gains on CUB and Flowers. The work highlights the importance of sampling options from the model’s distribution, joint vision–text fine-tuning, and controlled inference cost via the hyperparameter , offering a promising direction for robust fine-grained discrimination in LVLMs.

Abstract

Large Vision Language Models (LVLMs) possess extensive text knowledge but struggles to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact-match reward signals are often brittle, encourage memorization of training categories, and fail to elicit differential reasoning needed for generalization to unseen classes. To address this, we propose , fferential isual rasoning using top- generations, framework that leverages model's own top-k predictions as a training signal. For each training image, DiVE-k creates a multiple-choice question from the model's top-k outputs and uses RL to train the model to select the correct answer. This approach requires the model to perform fine-grained differential reasoning among plausible options and provides a simple, verifiable reward signal that mitigates memorization and improves generalization. Experiments on five standard fine-grained datasets show that our method significantly outperforms existing approaches. In the standard base-to-novel generalization setting, DiVE-k surpasses the QWEN2.5-VL-7B and ViRFT by 10.04% and 6.16% on the Harmonic Mean metric, respectively. Further experiments show similar gains in mixed-domain and few-shot scenarios.

Paper Structure

This paper contains 26 sections, 4 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: For fine-grained image recognition task, most salient visual attributes are often insufficient to identify the correct category as its common among similar categories. (a) This leads to a significant performance gap in model's Pass@1 and Pass@20 accuracy (b) A differential reasoning can help indicate out the key visual attributes that can help distinguish among similar categories. Base model fails to use such discriminative features relying only on prominent visual features. We solve this by using top-k as options (the most likely categories base model confuses it for) and utilizes model's text knowledge to resolve this confusion using differential reasoning (highlighted in green).
  • Figure 2: An overview of DiVE-k framework. First we do an offline option mining (red box) where for each training image, we sample $K$ rollouts from a pretrained LVLM and select top-k options by frequency, ensuring the ground-truth appears. Next we perform RL training using GRPO on MCQ prompts (green box): the model receives an image, a natural language prompt, and k options as input and produces a reasoning chain and a final choice and is optimized with a simple, verifiable reward that combines MCQ correctness and format compliance.
  • Figure 3: An example to illustrate our inference pipeline (red arrows) and its comparison to existing method (blue arrows). Similar to training phase, we perform inference in two steps (right of dotted line), where we first generate option by choosing top-k responses from $K$ rollouts and then model picks the correct answer among the options unlike open-ended one step inference of existing methods (left of dotted line)
  • Figure 4: Qualitative comparison on fine-grained flower recognition (top: ViRFT; bottom: Ours). Top: ViRFT predicts "global thistle,” which is incorrect and reflects a coarse judgment. Bottom: Our method enumerates close candidates and uses attribute-grounded, differential reasoning such as capitulum/head shape, floret density and arrangement, bract patterning to select the correct fine-grained label with a justification aligned to the final choice.
  • Figure 5: Change in classification accuracy for different values of $K$ on different dataset. Our approach consistently outperforms the baselines for nearly all $K$.
  • ...and 4 more figures