Table of Contents
Fetching ...

Generalizable Dense Reward for Long-Horizon Robotic Tasks

Silong Yong, Stephen Sheng, Carl Qi, Xiaojie Wang, Evan Sheehan, Anurag Shivaprasad, Yaqi Xie, Katia Sycara, Yesh Dattatreya

Abstract

Existing robotic foundation policies are trained primarily via large-scale imitation learning. While such models demonstrate strong capabilities, they often struggle with long-horizon tasks due to distribution shift and error accumulation. While reinforcement learning (RL) can finetune these models, it cannot work well across diverse tasks without manual reward engineering. We propose VLLR, a dense reward framework combining (1) an extrinsic reward from Large Language Models (LLMs) and Vision-Language Models (VLMs) for task progress recognition, and (2) an intrinsic reward based on policy self-certainty. VLLR uses LLMs to decompose tasks into verifiable subtasks and then VLMs to estimate progress to initialize the value function for a brief warm-up phase, avoiding prohibitive inference cost during full training; and self-certainty provides per-step intrinsic guidance throughout PPO finetuning. Ablation studies reveal complementary benefits: VLM-based value initialization primarily improves task completion efficiency, while self-certainty primarily enhances success rates, particularly on out-of-distribution tasks. On the CHORES benchmark covering mobile manipulation and navigation, VLLR achieves up to 56% absolute success rate gains over the pretrained policy, up to 5% gains over state-of-the-art RL finetuning methods on in-distribution tasks, and up to $10\%$ gains on out-of-distribution tasks, all without manual reward engineering. Additional visualizations can be found in https://silongyong.github.io/vllr_project_page/

Generalizable Dense Reward for Long-Horizon Robotic Tasks

Abstract

Existing robotic foundation policies are trained primarily via large-scale imitation learning. While such models demonstrate strong capabilities, they often struggle with long-horizon tasks due to distribution shift and error accumulation. While reinforcement learning (RL) can finetune these models, it cannot work well across diverse tasks without manual reward engineering. We propose VLLR, a dense reward framework combining (1) an extrinsic reward from Large Language Models (LLMs) and Vision-Language Models (VLMs) for task progress recognition, and (2) an intrinsic reward based on policy self-certainty. VLLR uses LLMs to decompose tasks into verifiable subtasks and then VLMs to estimate progress to initialize the value function for a brief warm-up phase, avoiding prohibitive inference cost during full training; and self-certainty provides per-step intrinsic guidance throughout PPO finetuning. Ablation studies reveal complementary benefits: VLM-based value initialization primarily improves task completion efficiency, while self-certainty primarily enhances success rates, particularly on out-of-distribution tasks. On the CHORES benchmark covering mobile manipulation and navigation, VLLR achieves up to 56% absolute success rate gains over the pretrained policy, up to 5% gains over state-of-the-art RL finetuning methods on in-distribution tasks, and up to gains on out-of-distribution tasks, all without manual reward engineering. Additional visualizations can be found in https://silongyong.github.io/vllr_project_page/

Paper Structure

This paper contains 32 sections, 9 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Method overview. VLLR is a reward model with three components, i.e. sparse task success reward, intrinsic reward and extrinsic reward (Sec. \ref{['subsec:final']}). The environment provides task instruction and a scene graph context to the LLM for task decomposition (Sec. \ref{['subsec:subtask']}). Then the decomposed subgoals are fed into a VLM together with the current and last observations (Sec. \ref{['subsec:vlm_pe']}). The VLM provides a noisy progress estimate which is then smoothed out (Sec. \ref{['subsec:smooth']}) and used as a reward signal for initializing the value function (Sec. \ref{['subsec:value']}) used for PPO. The intrinsic reward is calculated using the policy's output action distribution (Sec. \ref{['subsec:intrinsic']}). It is fed together with the task success signal into the PPO algorithm and then produces updates for finetuning the pretrained policy (Sec. \ref{['subsec:final']}).
  • Figure 2: We showcase an example to compare different VLM progress estimation including Qwen, Nova Pro and CLIP. The VLMs are tasked to estimate progress for finding a laptop that is not in plain sight. We found that Qwen tends to over-saturate the estimation, mostly caused by falsely recognizing the laptop, CLIP is unstable, wrong and unable to identify overall task completion, and Nova Pro is the best at progress estimation, which means it provides correct signal when seeing the laptop and approaching it. Here the rollout is collected using A* and the ground truth progress should be linearly increasing.
  • Figure 3: We showcase an example where raw progress estimation from Nova Pro is noisy and our method is able to identify the actual progress made by the agent. The task is to find a clock and grasp it. In the first column, we can see that throughout the rollout, the progress estimation is noisy and problematic, unable to provide meaningful reward signals. In the second column, our method is able to provide three clear jumps in terms of progress estimation. The first jump rewards the fact that the robot sees the clock, the second jump rewards the robot positioning itself in front of the clock and the third jump rewards the robot actually picking up the clock. In the third column, we showcase the actual observation Nova Pro is provided. We highlight the clock in red box in frame 1 and 20. Frame 1 showcase the fact that the robot sees the clock and frame 20 demonstrates the robot positioning itself in front of the clock. We connected these two observations to the first two jumps in our progress estimation.