Table of Contents
Fetching ...

VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences

Anukriti Singh, Amisha Bhaskar, Peihong Yu, Souradip Chakraborty, Ruthwik Dasyam, Amrit Bedi, Pratap Tokekar

TL;DR

The paper addresses reward misalignment and annotation scalability in robotic RL by introducing VARP, which adds trajectory sketches to final observations and couples VLM-based preferences with agent-aware reward regularization. This two-pronged approach improves the accuracy of preference signals and aligns the learned reward with the evolving policy, mitigating reward hacking. Empirical results on MetaWorld and DMControl show substantial gains: preference accuracy rises from ~68% to ~84%, episodic returns improve by ~20-30%, and task success rates approach 80% from below 50%. Overall, VARP demonstrates scalable, interpretable preference-based RL that combines richer visual feedback with policy-consistent reward shaping for robust robotic learning.

Abstract

Designing reward functions for continuous-control robotics often leads to subtle misalignments or reward hacking, especially in complex tasks. Preference-based RL mitigates some of these pitfalls by learning rewards from comparative feedback rather than hand-crafted signals, yet scaling human annotations remains challenging. Recent work uses Vision-Language Models (VLMs) to automate preference labeling, but a single final-state image generally fails to capture the agent's full motion. In this paper, we present a two-part solution that both improves feedback accuracy and better aligns reward learning with the agent's policy. First, we overlay trajectory sketches on final observations to reveal the path taken, allowing VLMs to provide more reliable preferences-improving preference accuracy by approximately 15-20% in metaworld tasks. Second, we regularize reward learning by incorporating the agent's performance, ensuring that the reward model is optimized based on data generated by the current policy; this addition boosts episode returns by 20-30% in locomotion tasks. Empirical studies on metaworld demonstrate that our method achieves, for instance, around 70-80% success rate in all tasks, compared to below 50% for standard approaches. These results underscore the efficacy of combining richer visual representations with agent-aware reward regularization.

VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences

TL;DR

The paper addresses reward misalignment and annotation scalability in robotic RL by introducing VARP, which adds trajectory sketches to final observations and couples VLM-based preferences with agent-aware reward regularization. This two-pronged approach improves the accuracy of preference signals and aligns the learned reward with the evolving policy, mitigating reward hacking. Empirical results on MetaWorld and DMControl show substantial gains: preference accuracy rises from ~68% to ~84%, episodic returns improve by ~20-30%, and task success rates approach 80% from below 50%. Overall, VARP demonstrates scalable, interpretable preference-based RL that combines richer visual feedback with policy-consistent reward shaping for robust robotic learning.

Abstract

Designing reward functions for continuous-control robotics often leads to subtle misalignments or reward hacking, especially in complex tasks. Preference-based RL mitigates some of these pitfalls by learning rewards from comparative feedback rather than hand-crafted signals, yet scaling human annotations remains challenging. Recent work uses Vision-Language Models (VLMs) to automate preference labeling, but a single final-state image generally fails to capture the agent's full motion. In this paper, we present a two-part solution that both improves feedback accuracy and better aligns reward learning with the agent's policy. First, we overlay trajectory sketches on final observations to reveal the path taken, allowing VLMs to provide more reliable preferences-improving preference accuracy by approximately 15-20% in metaworld tasks. Second, we regularize reward learning by incorporating the agent's performance, ensuring that the reward model is optimized based on data generated by the current policy; this addition boosts episode returns by 20-30% in locomotion tasks. Empirical studies on metaworld demonstrate that our method achieves, for instance, around 70-80% success rate in all tasks, compared to below 50% for standard approaches. These results underscore the efficacy of combining richer visual representations with agent-aware reward regularization.

Paper Structure

This paper contains 16 sections, 8 equations, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: Left: This diagram breaks down our method into four key stages: (1) sketch data generation from the full trajectory, (2) two-stage VLM preference querying using the sketched observations, (3) reward model training that balances VLM feedback with agent performance, and (4) policy optimization using the learned reward. This comprehensive approach enhances feedback accuracy and stabilizes policy learning. Right: Comparison of VLM preference outputs when using final-state images with (VARP) and without wang2024rl trajectory sketches. The added sketches provide crucial temporal context, resulting in more accurate preference judgments.
  • Figure 2: Illustration of 2D trajectory sketch generation. For each episode, the robot’s full trajectory (denoted by $\tau$) is projected onto the camera’s 2D plane using known parameters, and overlaid on the final state image $o$ to form an augmented observation $\hat{o}=(o,\mathrm{Sketch}(\tau))$. This enriched representation provides additional temporal context, enabling the VLM to more accurately compare and assess trajectory performance.
  • Figure 3: Impact of Trajectory Sketches on Preference Accuracy. Top row: VLM predictions using trajectory sketches. Bottom row: VLM predictions without sketches. The results show that incorporating sketches dramatically improves the accuracy of preference judgments, particularly when the difference in task progress between image pairs increases. This confirms that visualizing the entire trajectory—not just the final state—provides essential context for reliable feedback.
  • Figure 4: Performance Comparison on MetaWorld Tasks. We compare VARP against baseline approaches across three MetaWorld tasks (Drawer Open, Soccer, Sweep Into). Our method, which combines trajectory sketches with agent preference regularization, consistently achieves higher episode rewards. The plot highlights how the enhanced preference accuracy translates into faster training and better overall policy performance, closing the gap to oracle-level behavior.
  • Figure 5: Evaluating RLHF Accuracy with VARP in Metaworld. This figure quantifies the accuracy of our reward learning by comparing RLHF predictions to ground-truth preferences derived from the environment’s reward function. The results demonstrate that the agent preference substantially reduces reward misalignment (and hence reward hacking), ensuring more stable and effective policy improvements.
  • ...and 1 more figures