Table of Contents
Fetching ...

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Jingxiao Yang, DaLin He, Miao Pan, Ge Su, Wenqi Zhang, Yifeng Hu, Tangwei Li, Yuke Li, Xuhong Zhang

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.
Paper Structure (25 sections, 6 equations, 11 figures, 7 tables)

This paper contains 25 sections, 6 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: (a) Fine-Grained Visual Recognition challenge: sub-categories (e.g., Siberian Husky vs. Alaskan Malamute) exhibit subtle visual differences, requiring attention to localized discriminative features such as forehead patterns and tail shape. (b) For a fixed overall retrieval accuracy (70%), different sub-categories require substantially different confidence thresholds, indicating that identical confidence scores imply varying reliability across categories.
  • Figure 2: Overview of SARE. The framework performs fast prototype-based retrieval to generate candidate categories for a query image, followed by a class-conditional trigger that adaptively invokes fine-grained reasoning when retrieval confidence is insufficient. Experience distilled from past errors is injected as contextual guidance, enabling accurate and efficient training-free FGVR.
  • Figure 3: Comparison of SARE against baselines on Stanford Cars dataset. SARE achieves the optimal balance, significantly outperforming baselines in accuracy with lower inference overhead.
  • Figure 4: The proportion of samples triggering System 2 across datasets with varying recognition difficulty. The x-axis measures dataset-level difficulty using $100\%-CLIP_{Top-1}$ accuracy.
  • Figure 5: Performance of SARE across different backbone architectures. SARE consistently enhances performance across all backbones, with larger relative gains on lower-performing models.
  • ...and 6 more figures