SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Jingxiao Yang; DaLin He; Miao Pan; Ge Su; Wenqi Zhang; Yifeng Hu; Tangwei Li; Yuke Li; Xuhong Zhang

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Jingxiao Yang, DaLin He, Miao Pan, Ge Su, Wenqi Zhang, Yifeng Hu, Tangwei Li, Yuke Li, Xuhong Zhang

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Abstract

Paper Structure (25 sections, 6 equations, 11 figures, 7 tables)

This paper contains 25 sections, 6 equations, 11 figures, 7 tables.

Introduction
Methodology
Knowledge Base Construction
Sample-wise Adaptive Inference
Statistics-based Dynamic Trigger
Experiments
Setup
Main Results
Analysis
Effectiveness of Dynamic Trigger
Ablation Studies
Performance on different Backbones
Transferability Analysis
Behavioral Analysis
Qualitative Analysis and Case Studies
...and 10 more sections

Figures (11)

Figure 1: (a) Fine-Grained Visual Recognition challenge: sub-categories (e.g., Siberian Husky vs. Alaskan Malamute) exhibit subtle visual differences, requiring attention to localized discriminative features such as forehead patterns and tail shape. (b) For a fixed overall retrieval accuracy (70%), different sub-categories require substantially different confidence thresholds, indicating that identical confidence scores imply varying reliability across categories.
Figure 2: Overview of SARE. The framework performs fast prototype-based retrieval to generate candidate categories for a query image, followed by a class-conditional trigger that adaptively invokes fine-grained reasoning when retrieval confidence is insufficient. Experience distilled from past errors is injected as contextual guidance, enabling accurate and efficient training-free FGVR.
Figure 3: Comparison of SARE against baselines on Stanford Cars dataset. SARE achieves the optimal balance, significantly outperforming baselines in accuracy with lower inference overhead.
Figure 4: The proportion of samples triggering System 2 across datasets with varying recognition difficulty. The x-axis measures dataset-level difficulty using $100\%-CLIP_{Top-1}$ accuracy.
Figure 5: Performance of SARE across different backbone architectures. SARE consistently enhances performance across all backbones, with larger relative gains on lower-performing models.
...and 6 more figures

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Abstract

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Authors

Abstract

Table of Contents

Figures (11)