Table of Contents
Fetching ...

Multimodal Reinforcement Learning with Agentic Verifier for AI Agents

Reuben Tan, Baolin Peng, Zhengyuan Yang, Hao Cheng, Oier Mees, Theodore Zhao, Andrea Tupini, Isar Meijier, Qianhui Wu, Yuncong Yang, Lars Liden, Yu Gu, Sheng Zhang, Xiaodong Liu, Lijuan Wang, Marc Pollefeys, Yong Jae Lee, Jianfeng Gao

TL;DR

This work introduces Argos, a principled agentic verifier for multimodal reinforcement learning that adaptively selects teacher-derived and rule-based scoring functions to produce dense, verifiable rewards across spatial grounding, temporal grounding, reasoning quality, and final outcomes. By reframing MMRL as a multi-objective optimization problem and training with GRPO, Argos yields state-of-the-art performance across spatial reasoning, visual hallucination reduction, and embodied robotics benchmarks, while mitigating reward hacking observed with outcome-only signals. The authors provide theoretical support via Pareto-optimality analysis and demonstrate a robust data-curation pipeline that generates visually grounded reasoning traces for SFT and RL. Overall, Argos offers a modular, scalable approach to richer reward signals in multimodal agents, with strong empirical results and clear avenues for extension to additional modalities and tasks.

Abstract

Agentic reasoning models trained with multimodal reinforcement learning (MMRL) have become increasingly capable, yet they are almost universally optimized using sparse, outcome-based rewards computed based on the final answers. Richer rewards computed from the reasoning tokens can improve learning significantly by providing more fine-grained guidance. However, it is challenging to compute more informative rewards in MMRL beyond those based on outcomes since different samples may require different scoring functions and teacher models may provide noisy reward signals too. In this paper, we introduce the Argos (Agentic Reward for Grounded & Objective Scoring), a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization of referred entities and actions, and (iii) the quality of the reasoning process. We find that by leveraging our agentic verifier across both SFT data curation and RL training, our model achieves state-of-the-art results across multiple agentic tasks such as spatial reasoning, visual hallucination as well as robotics and embodied AI benchmarks. Critically, we demonstrate that just relying on SFT post-training on highly curated reasoning data is insufficient, as agents invariably collapse to ungrounded solutions during RL without our online verification. We also show that our agentic verifier can help to reduce reward-hacking in MMRL. Finally, we also provide a theoretical justification for the effectiveness of Argos through the concept of pareto-optimality.

Multimodal Reinforcement Learning with Agentic Verifier for AI Agents

TL;DR

This work introduces Argos, a principled agentic verifier for multimodal reinforcement learning that adaptively selects teacher-derived and rule-based scoring functions to produce dense, verifiable rewards across spatial grounding, temporal grounding, reasoning quality, and final outcomes. By reframing MMRL as a multi-objective optimization problem and training with GRPO, Argos yields state-of-the-art performance across spatial reasoning, visual hallucination reduction, and embodied robotics benchmarks, while mitigating reward hacking observed with outcome-only signals. The authors provide theoretical support via Pareto-optimality analysis and demonstrate a robust data-curation pipeline that generates visually grounded reasoning traces for SFT and RL. Overall, Argos offers a modular, scalable approach to richer reward signals in multimodal agents, with strong empirical results and clear avenues for extension to additional modalities and tasks.

Abstract

Agentic reasoning models trained with multimodal reinforcement learning (MMRL) have become increasingly capable, yet they are almost universally optimized using sparse, outcome-based rewards computed based on the final answers. Richer rewards computed from the reasoning tokens can improve learning significantly by providing more fine-grained guidance. However, it is challenging to compute more informative rewards in MMRL beyond those based on outcomes since different samples may require different scoring functions and teacher models may provide noisy reward signals too. In this paper, we introduce the Argos (Agentic Reward for Grounded & Objective Scoring), a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization of referred entities and actions, and (iii) the quality of the reasoning process. We find that by leveraging our agentic verifier across both SFT data curation and RL training, our model achieves state-of-the-art results across multiple agentic tasks such as spatial reasoning, visual hallucination as well as robotics and embodied AI benchmarks. Critically, we demonstrate that just relying on SFT post-training on highly curated reasoning data is insufficient, as agents invariably collapse to ungrounded solutions during RL without our online verification. We also show that our agentic verifier can help to reduce reward-hacking in MMRL. Finally, we also provide a theoretical justification for the effectiveness of Argos through the concept of pareto-optimality.

Paper Structure

This paper contains 44 sections, 5 theorems, 34 equations, 18 figures, 8 tables.

Key Result

Theorem 1

Let $\pi$ be the sampling policy and denote $\beta = \pi(P_\delta)$ as the probability coverage on Pareto optimal solutions. Sample $n$ i.i.d. actions from $\pi$ to form a group $\mathbb{G}$ with $\hat{a} = \arg\max_{a \in \mathbb{G}} \hat{R}(a)$. Then where $C := \frac{\delta^2}{4\sigma^2} \cdot \frac{w_{\min}^2}{w_{\max}^2} > 0$ is a constant.

Figures (18)

  • Figure 1: Multimodal RL with our Argos agentic verifier. We propose to train agentic foundation models using an agentic verifier Argos that adaptively selects different scoring tools based on the training sample during the RL stage. Then, we evaluate the resulting model on multiple agentic benchmarks including embodied task planning and completion as well as spatial reasoning.
  • Figure 2: Verification process. We use the same set of scoring functions for both images and videos. Each response is first parsed to extract information about generated 2D points, temporal segments, reasoning text and answer. Then, the agentic verifier adaptively decides what scoring functions to call based on the extracted information. Finally, we aggregate the scores using a gated aggregation function.
  • Figure 3: Grounded reasoning generation pipeline.(a) Stage I: We extract object, action and event proposals such as 2D boxes for images and video frames as well as temporal segments for videos. (b) Stage II. We use the overlaid images and video frames to prompt a pretrained LMM to generate grounded reasoning traces that explicitly refer to these points. For videos, we also include the frame numbers and their timestamps in the query. (c) Stage III. Our agentic verifier adaptively scores each trace using multi-objective rewards (e.g., visual grounding and answer accuracy) and filters out samples with low-quality generations. In the image example with the bears, the sample is filtered out due to low visual grounding accuracy despite predicting the correct answer.
  • Figure 4: We run a small-scale comparison to ablate the effectiveness of Argos (agentic) compared to using only outcome rewards (non-agentic). We evaluate on a separate dataset for both variants.
  • Figure 5: Instruction template for extracting entities and interactions from generated image-level rollouts.
  • ...and 13 more figures

Theorems & Definitions (15)

  • Definition 1: $\delta$-Pareto Domination
  • Definition 2: $\delta$-Pareto Optimality
  • Theorem 1: Global Pareto Guarantee
  • Definition 1: $\delta$-Pareto Domination
  • Definition 2: $\delta$-Pareto Optimality
  • Lemma 1
  • proof
  • Lemma 2: Batch-Level Approximate Pareto Preservation
  • proof
  • Theorem 1: Global Pareto Guarantee
  • ...and 5 more