Table of Contents
Fetching ...

Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

Liqin Luo, Guangyao Chen, Xiawu Zheng, Yongxing Dai, Yixiong Zou, Yonghong Tian

TL;DR

GroundingAgent introduces a fully training-free visual grounding framework that fuses open-vocabulary detectors, multimodal LLMs, and an LLM-based, step-by-step reasoning process to perform zero-shot referring expression grounding. By generating semantically rich candidate regions from a global caption and the query, enriching each candidate with regional descriptions, and applying a Chain-of-Thought driven selection, the method achieves state-of-the-art zero-shot accuracy on RefCOCO, RefCOCO+, and RefCOCOg. Crucially, replacing MLLM-generated captions with the original query elevates selection accuracy to roughly 90%, underscoring the pivotal role of robust semantic reasoning. The approach is highly interpretable, modular, and demonstrates robustness across detectors and LLMs, offering a practical baseline for training-free grounding and potential extension to segmentation via lightweight refinement.

Abstract

Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.

Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

TL;DR

GroundingAgent introduces a fully training-free visual grounding framework that fuses open-vocabulary detectors, multimodal LLMs, and an LLM-based, step-by-step reasoning process to perform zero-shot referring expression grounding. By generating semantically rich candidate regions from a global caption and the query, enriching each candidate with regional descriptions, and applying a Chain-of-Thought driven selection, the method achieves state-of-the-art zero-shot accuracy on RefCOCO, RefCOCO+, and RefCOCOg. Crucially, replacing MLLM-generated captions with the original query elevates selection accuracy to roughly 90%, underscoring the pivotal role of robust semantic reasoning. The approach is highly interpretable, modular, and demonstrates robustness across detectors and LLMs, offering a practical baseline for training-free grounding and potential extension to segmentation via lightweight refinement.

Abstract

Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.

Paper Structure

This paper contains 35 sections, 9 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Qualitative comparison and reasoning steps for the visual grounding task. Given the same image and query, the baseline GPT-4o prediction (red box) incorrectly selects the pitcher (left). Our method performs several iterative instance proposals to find the correct object through visual reasoning.
  • Figure 2: Illustration of our step-by-step reasoning framework for zero-shot referring expression comprehension. Given an input image and a textual query (e.g., "the white chair by the fireplace"), the system first extracts a global description $\mathbf{T}_{\text{global}}$ of the scene and generates candidate bounding boxes ($\{\textbf{b}_i\}$) through an object detector. For each candidate region $\textbf{b}_i$, an MLLM is employed to generate a fine-grained semantic description $\textbf{d}_i$, capturing detailed visual attributes and contextual cues. These descriptions, along with the global context and the original query, are passed to an LLM, which performs step-by-step reasoning to refine its understanding of each candidate. In this example, four reasoning steps guide the LLM to identify and confirm the correct bounding box for the white chair, ensuring consistency with the spatial layout and visual attributes described in the query. The final prediction $\mathbf{b}_{\text{pred}}$ is chosen from the candidate set as the best match for the referring expression.
  • Figure 3: The recall of candidate generation on RefCOCO.
  • Figure 4: Representative failure cases. (a) Occlusion hallucination: confusion caused by partial occlusion. (b) Merged instances: separate objects grouped incorrectly. (c) Incomplete box: partial coverage of the target. (d) Ambiguous query: unclear description prevents unique identification.
  • Figure 5: Qualitative example illustrating how our GroundingAgent method effectively handles spatial relationships and fine-grained visual descriptions. Given the query "oranges closest to banana middle", our framework first generates multiple candidate bounding boxes for oranges and bananas, each enriched with detailed semantic descriptions provided by the MLLM. Subsequently, the LLM systematically analyzes spatial coordinates and descriptive semantics, explicitly reasoning through the proximity of each orange to the middle banana to identify the closest instance correctly.
  • ...and 4 more figures