Table of Contents
Fetching ...

Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding

Junhan Chen, Zilu Zhou, Yujun Tong, Dongliang Chang, Yitao Luo, Zhanyu Ma

TL;DR

The Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA), a unified framework that transforms fine-grained perception into evidence-driven reasoning, is presented, which enables factual, interpretable, and task-agnostic reasoning across diverse fine-grained scenarios.

Abstract

Fine-grained visual understanding is shifting from static classification to knowledge-augmented reasoning, where models must justify as well as recognise. Existing approaches remain limited by closed-set taxonomies and single-label prediction, leading to significant degradation under open-set or context-dependent conditions. We present the Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA), a unified framework that transforms fine-grained perception into evidence-driven reasoning. KFRA operates through a three-stage closed reasoning loop that emulates expert analysis. It first performs open-vocabulary detection and web-scale retrieval to generate category hypotheses. It then conducts discriminative regions localisation by aligning textual knowledge with visual evidence through a global-to-local focusing mechanism. Finally, it integrates all multimodal evidence within a large multimodal model to perform interpretable reasoning. Unlike existing agents that treat retrieval and reasoning as independent processes, KFRA establishes a retrieval-grounding coupling that converts retrieved knowledge into spatially grounded evidence for verification. This design enables factual, interpretable, and task-agnostic reasoning across diverse fine-grained scenarios. To evaluate this capability, we construct FGExpertBench, a benchmark designed to assess reasoning depth and cross-task generalisation across six knowledge dimensions. Extensive experiments demonstrate that KFRA consistently surpasses both standalone large multimodal models and current agent frameworks, achieving up to 19 percent improvement in reasoning accuracy and delivering evidence-grounded interpretability in open-set fine-grained visual understanding.

Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding

TL;DR

The Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA), a unified framework that transforms fine-grained perception into evidence-driven reasoning, is presented, which enables factual, interpretable, and task-agnostic reasoning across diverse fine-grained scenarios.

Abstract

Fine-grained visual understanding is shifting from static classification to knowledge-augmented reasoning, where models must justify as well as recognise. Existing approaches remain limited by closed-set taxonomies and single-label prediction, leading to significant degradation under open-set or context-dependent conditions. We present the Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA), a unified framework that transforms fine-grained perception into evidence-driven reasoning. KFRA operates through a three-stage closed reasoning loop that emulates expert analysis. It first performs open-vocabulary detection and web-scale retrieval to generate category hypotheses. It then conducts discriminative regions localisation by aligning textual knowledge with visual evidence through a global-to-local focusing mechanism. Finally, it integrates all multimodal evidence within a large multimodal model to perform interpretable reasoning. Unlike existing agents that treat retrieval and reasoning as independent processes, KFRA establishes a retrieval-grounding coupling that converts retrieved knowledge into spatially grounded evidence for verification. This design enables factual, interpretable, and task-agnostic reasoning across diverse fine-grained scenarios. To evaluate this capability, we construct FGExpertBench, a benchmark designed to assess reasoning depth and cross-task generalisation across six knowledge dimensions. Extensive experiments demonstrate that KFRA consistently surpasses both standalone large multimodal models and current agent frameworks, achieving up to 19 percent improvement in reasoning accuracy and delivering evidence-grounded interpretability in open-set fine-grained visual understanding.
Paper Structure (11 sections, 10 equations, 6 figures, 4 tables)

This paper contains 11 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA). The agent operates through a three-stage closed reasoning loop that emulates expert analysis. (1) Candidate List Generation: KFRA performs open-vocabulary detection and web-scale retrieval to propose category hypotheses. (2) Discriminative Regions Localisation: For each hypothesis, it retrieves relevant textual knowledge and aligns distinctive cues such as colour, structure, or behaviour with corresponding visual regions through a global-to-local focusing mechanism. (3) Knowledge and Region Guided Inference: The agent integrates all multimodal evidence to perform factual and interpretable reasoning across objects and tasks.
  • Figure 2: Comparison of Fine-Grained Paradigms. KFRA advances fine-grained understanding from static classification to evidence-driven reasoning, leveraging dynamic knowledge and retrieval–grounding coupling to achieve factual, interpretable, and task-agnostic understanding.
  • Figure 3: Overview of FGExpertBench. The benchmark introduces a structured taxonomy of fine-grained visual understanding tasks that span multiple perception and reasoning dimensions. Each category includes representative examples illustrating its task type, while the overall design enables systematic evaluation of large multimodal models in terms of reasoning depth, cross-task generalisation, and evidence grounding.
  • Figure 4: The construction pipeline of FGExpertBench. We first collect diverse fine-grained categories followed by retrieving corresponding images through web search. GPT-4o is then used to identify potential professional communities relevant to the image content, and domain experts check these professions while providing the corresponding knowledge scopes they focus on. Based on the curated knowledge, GPT-4o generates domain-oriented question-answer pairs. After data filtering and quality control, the finalized benchmark is produced for evaluating large multimodal models on fine-grained visual understanding.
  • Figure 5: Qualitative results of all comparison methods on FGExpertBench. KFRA produces the most accurate answers across diverse fine-grained scenarios, highlighting the strong capability in visual reasoning and multimodal understanding.
  • ...and 1 more figures