Table of Contents
Fetching ...

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang

TL;DR

<3-5 sentence high-level summary> ARM-Thinker introduces an agentic multimodal reward model that actively uses external tools to ground judgments in verifiable evidence through a think–act–verify loop. It combines data-efficient SFT and a two-stage GRPO training regime to learn when and how to invoke tools for accurate, evidence-backed judgments, and it introduces ARMBench-VL to evaluate such agentic rewards across fine-grained perception, long-document retrieval, and instruction-following tasks. Empirical results show substantial gains on reward-modeling benchmarks, tool-use tasks, and general multimodal reasoning, outperforming strong LVLM baselines and specialized tool-use models. The work demonstrates that agentic verification improves accuracy, interpretability, and generalization, enabling more reliable multimodal alignment and reasoning systems.

Abstract

Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

TL;DR

<3-5 sentence high-level summary> ARM-Thinker introduces an agentic multimodal reward model that actively uses external tools to ground judgments in verifiable evidence through a think–act–verify loop. It combines data-efficient SFT and a two-stage GRPO training regime to learn when and how to invoke tools for accurate, evidence-backed judgments, and it introduces ARMBench-VL to evaluate such agentic rewards across fine-grained perception, long-document retrieval, and instruction-following tasks. Empirical results show substantial gains on reward-modeling benchmarks, tool-use tasks, and general multimodal reasoning, outperforming strong LVLM baselines and specialized tool-use models. The work demonstrates that agentic verification improves accuracy, interpretability, and generalization, enabling more reliable multimodal alignment and reasoning systems.

Abstract

Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.

Paper Structure

This paper contains 30 sections, 5 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Overview of ARM-Thinker.(a) Case Comparison: Given a complex document QA task, ARM-Thinker correctly identifies the answer by autonomously invoking the retrieval tool, while the baseline model provides an incorrect response. (b) ARMBench-VL: It evaluates reward models across three task types, each requiring specialized tool use (image manipulation, document retrieval, instruction verification). (c) Performance of ARM-Thinker: The agentic capability enables substantial gains across multiple benchmarks.
  • Figure 2: Overview of ARM-Thinker's architecture and training pipeline. (a) Agent Loop: ARM-Thinker follows a think-act-observe paradigm, maintaining indexed context for texts and images while iteratively invoking tools from the toolkit (image zoom-in, document retrieval, instruction validators) until producing the final answer. (b) Pipeline: our pipeline starting with (1) SFT & Cold Start using difficulty-filtered data, followed by (2) two-stage Group Relative Policy Optimization grpo(GRPO) that first encourages correct tool calls (Stage 1) and then refines for accuracy with verifiable rewards that balance correctness and tool efficiency (Stage 2).
  • Figure 3: Representative examples from ARMBench-VL. Each block shows the multimodal context, candidate responses, and available tools for one of the three tracks in ARMBench-VL: Fine-grained Perception (image crop/zoom tools for local visual details), Multimodal Long Document QA (page-retrieval tools), and Multimodal Instruction Following (instruction-checking tools).
  • Figure 4: Ablation study comparing three reward function designs during GRPO training. Left: Evaluation accuracy over training steps. Right: Average tool-call frequency over training steps. Our ARM-Thinker reward (blue) achieves the highest accuracy while maintaining stable tool usage, avoiding both the under-use pitfall of accuracy-only rewards (orange) and the over-use pitfall of fixed tool rewards (green).
  • Figure 5: Single judge for Instruction Following Task in ARMBench-VL
  • ...and 6 more figures