Table of Contents
Fetching ...

Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models

Yanru Wu, Weiduo Yuan, Ang Qi, Vitor Guizilini, Jiageng Mao, Yue Wang

Abstract

Reinforcement Learning (RL) has shown great potential in refining robotic manipulation policies, yet its efficacy remains strongly bottlenecked by the difficulty of designing generalizable reward functions. In this paper, we propose a framework for online policy refinement by adapting foundation VLMs into online reward generators. We develop a robust, scalable reward model based on a state-of-the-art VLM, trained on a large-scale, multi-source dataset encompassing real-world robot trajectories, human-object interactions, and diverse simulated environments. Unlike prior approaches that evaluate entire trajectories post-hoc, our method leverages the VLM to formulate a multifaceted reward signal comprising process, completion, and temporal contrastive rewards based on current visual observations. Initializing with a base policy trained via Imitation Learning (IL), we employ these VLM rewards to guide the model to correct sub-optimal behaviors in a closed-loop manner. We evaluate our framework on challenging long-horizon manipulation benchmarks requiring sequential execution and precise control. Crucially, our reward model operates in a purely zero-shot manner within these test environments. Experimental results demonstrate that our method significantly improves the success rate of the initial IL policy within just 30 RL iterations, demonstrating remarkable sample efficiency. This empirical evidence highlights that VLM-generated signals can provide reliable feedback to resolve execution errors, effectively eliminating the need for manual reward engineering and facilitating efficient online refinement for robot learning.

Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models

Abstract

Reinforcement Learning (RL) has shown great potential in refining robotic manipulation policies, yet its efficacy remains strongly bottlenecked by the difficulty of designing generalizable reward functions. In this paper, we propose a framework for online policy refinement by adapting foundation VLMs into online reward generators. We develop a robust, scalable reward model based on a state-of-the-art VLM, trained on a large-scale, multi-source dataset encompassing real-world robot trajectories, human-object interactions, and diverse simulated environments. Unlike prior approaches that evaluate entire trajectories post-hoc, our method leverages the VLM to formulate a multifaceted reward signal comprising process, completion, and temporal contrastive rewards based on current visual observations. Initializing with a base policy trained via Imitation Learning (IL), we employ these VLM rewards to guide the model to correct sub-optimal behaviors in a closed-loop manner. We evaluate our framework on challenging long-horizon manipulation benchmarks requiring sequential execution and precise control. Crucially, our reward model operates in a purely zero-shot manner within these test environments. Experimental results demonstrate that our method significantly improves the success rate of the initial IL policy within just 30 RL iterations, demonstrating remarkable sample efficiency. This empirical evidence highlights that VLM-generated signals can provide reliable feedback to resolve execution errors, effectively eliminating the need for manual reward engineering and facilitating efficient online refinement for robot learning.
Paper Structure (12 sections, 10 equations, 4 figures, 7 tables)

This paper contains 12 sections, 10 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The overview of our method. Our framework leverages specialized Large Reward Model (LRM) generation to facilitate online policy refinement for high-precision robotic control. Initially, diverse video trajectories from real-robot corpora, human-object interactions, and simulated benchmarks are processed to fine-tune a Qwen3-VL-8B-Instruct backbone via LoRA. This specialization yields three independent reward modalities: the Temporal Contrastive Reward ($r_{cont}$) for relative ranking, the Absolute Progress Reward ($r_{prog}$) for continuous estimation, and the Task Completion Reward ($r_{comp}$) for terminal state anchoring. During active interaction, the specialized LRM maps visual observations $I_t$ and task descriptions $d$ into a dense reward stream, which the policy $\pi_\phi$ utilizes to autonomously refine its control behaviors for high-precision manipulation.
  • Figure 2: Cumulative accuracy at varying tolerance thresholds ($\pm\Delta$). Our LRM consistently outperforms the baseline across all thresholds, with the most significant gains achieved in the high-precision regime $\pm0.2$.
  • Figure 3: Comparison of real-world robot rollouts between the SFT baseline and RL finetuning with LRM. The SFT baseline fails to complete the task, mistakenly placing the toy giraffe beside the target bowl. In contrast, RL finetuning with LRM successfully places the giraffe inside the bowl, demonstrating the effectiveness of LRM-driven policy improvement.
  • Figure 4: Robot setup for the pick and place task.