Table of Contents
Fetching ...

VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning

Yuji Wang, Wenlong Liu, Jingxuan Niu, Haoji Zhang, Yansong Tang

TL;DR

This work tackles the problem of hallucinations in tool-assisted visual grounding by introducing VG-Refiner, a two-stage think-rethink framework that explicitly analyzes and refines external tool outputs. TrRGR, trained with agentic RL via GRPO and guided by a refinement reward, enables the model to accept reliable tool feedback or correct faulty predictions, improving referring grounding under noisy tools. The PiTER protocol and two refinement metrics (CCR and NSRI) provide a fair, standardized evaluation of refinement ability across tool conditions, showing state-of-the-art performance on RefCOCO-series while preserving general QA capabilities with limited task-specific data. Overall, VG-Refiner offers a robust foundation for future multi-round tool-calling systems in vision-language reasoning.”

Abstract

Tool-integrated visual reasoning (TiVR) has demonstrated great potential in enhancing multimodal problem-solving. However, existing TiVR paradigms mainly focus on integrating various visual tools through reinforcement learning, while neglecting to design effective response mechanisms for handling unreliable or erroneous tool outputs. This limitation is particularly pronounced in referring and grounding tasks, where inaccurate detection tool predictions often mislead TiVR models into generating hallucinated reasoning. To address this issue, we propose the VG-Refiner, the first framework aiming at the tool-refined referring grounded reasoning. Technically, we introduce a two-stage think-rethink mechanism that enables the model to explicitly analyze and respond to tool feedback, along with a refinement reward that encourages effective correction in response to poor tool results. In addition, we propose two new metrics and establish fair evaluation protocols to systematically measure the refinement ability of current models. We adopt a small amount of task-specific data to enhance the refinement capability of VG-Refiner, achieving a significant improvement in accuracy and correction ability on referring and reasoning grounding benchmarks while preserving the general capabilities of the pretrained model.

VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning

TL;DR

This work tackles the problem of hallucinations in tool-assisted visual grounding by introducing VG-Refiner, a two-stage think-rethink framework that explicitly analyzes and refines external tool outputs. TrRGR, trained with agentic RL via GRPO and guided by a refinement reward, enables the model to accept reliable tool feedback or correct faulty predictions, improving referring grounding under noisy tools. The PiTER protocol and two refinement metrics (CCR and NSRI) provide a fair, standardized evaluation of refinement ability across tool conditions, showing state-of-the-art performance on RefCOCO-series while preserving general QA capabilities with limited task-specific data. Overall, VG-Refiner offers a robust foundation for future multi-round tool-calling systems in vision-language reasoning.”

Abstract

Tool-integrated visual reasoning (TiVR) has demonstrated great potential in enhancing multimodal problem-solving. However, existing TiVR paradigms mainly focus on integrating various visual tools through reinforcement learning, while neglecting to design effective response mechanisms for handling unreliable or erroneous tool outputs. This limitation is particularly pronounced in referring and grounding tasks, where inaccurate detection tool predictions often mislead TiVR models into generating hallucinated reasoning. To address this issue, we propose the VG-Refiner, the first framework aiming at the tool-refined referring grounded reasoning. Technically, we introduce a two-stage think-rethink mechanism that enables the model to explicitly analyze and respond to tool feedback, along with a refinement reward that encourages effective correction in response to poor tool results. In addition, we propose two new metrics and establish fair evaluation protocols to systematically measure the refinement ability of current models. We adopt a small amount of task-specific data to enhance the refinement capability of VG-Refiner, achieving a significant improvement in accuracy and correction ability on referring and reasoning grounding benchmarks while preserving the general capabilities of the pretrained model.

Paper Structure

This paper contains 17 sections, 7 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: In the left case, VG-Refiner performs explicit reasoning over the tool outputs via the CoT process, whereas REVPT merely confirms the tool feedback without any analytical examination, leading to its inability to detect tool-induced errors. The baseline model Qwen2.5-VL-7B of REVPT and VG-Refiner can locate the true object by its own capabilities without CoT. The right part shows that our VG-Refiner achieves grounding accuracy comparable to the 32B model across the average of five test splits on the RefCOCO series, under different tool conditions in the PiTER evaluation protocol.
  • Figure 2: The overall framework of VG-Refiner. In our reward design, we consider the quality of tool feedback $\text{IoU}_t$. For different circumstances, we adopt different levels of reward to encourage the model to refine the tool's incorrect results or accept the reliable results. We use GRPO to optimize the policy model, which produces various $G$ rollouts during training. After the think process, the model queries a referring visual toolkit for additional reference outputs. In GRPO, KL divergence constrains strategy deviation from the frozen reference model to ensure stable optimization.
  • Figure 3: User prompt for the PiTER evaluation process. This prompt is shared across all model types, requiring the model to produce grounding results in a JSON format through a single-stage conversation, without any CoT reasoning or tool interaction. The placeholder $\{\text{Question}\}$ is replaced with the referring expression, while $\{\text{tool results}\}$ is substituted with the feedback from either a strong or weak tool corresponding to the given question.
  • Figure 4: Visualization of VG-Refiner handling three representative types of tool-induced errors in TrRGR. The first two grounding error categories often occur in a good tool, EVF-SAM evf-sam, whereas the third occurs in the not fine-tuned Grounding DINO T grounding-dino.
  • Figure 5: Visualization of the overall reasoning paradigm, first performing self-thinking and then re-thinking based on the tool outputs.
  • ...and 3 more figures