Table of Contents
Fetching ...

Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning

Udita Ghosh, Dripta S. Raychaudhuri, Jiachen Li, Konstantinos Karydis, Amit Roy-Chowdhury

TL;DR

PrefVLM proposes a hybrid framework that uses Vision-Language Models to provide coarse trajectory-level preferences for reward learning in preference-based RL, significantly reducing human annotation needs. It couples VLM-based feedback with a lightweight, dynamics-aware adaptation and a KL-divergence–driven noise-mitigation strategy to refine the reward model. The approach achieves comparable or superior success to state-of-the-art baselines on five Meta-World manipulation tasks while requiring up to 2x less human feedback, and it demonstrates knowledge transfer across related tasks. This work demonstrates the practical scalability of foundation-model–assisted human-in-the-loop reinforcement learning for robotic manipulation.

Abstract

Preference-based reinforcement learning (RL) offers a promising approach for aligning policies with human intent but is often constrained by the high cost of human feedback. In this work, we introduce PrefVLM, a framework that integrates Vision-Language Models (VLMs) with selective human feedback to significantly reduce annotation requirements while maintaining performance. Our method leverages VLMs to generate initial preference labels, which are then filtered to identify uncertain cases for targeted human annotation. Additionally, we adapt VLMs using a self-supervised inverse dynamics loss to improve alignment with evolving policies. Experiments on Meta-World manipulation tasks demonstrate that PrefVLM achieves comparable or superior success rates to state-of-the-art methods while using up to 2 x fewer human annotations. Furthermore, we show that adapted VLMs enable efficient knowledge transfer across tasks, further minimizing feedback needs. Our results highlight the potential of combining VLMs with selective human supervision to make preference-based RL more scalable and practical.

Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning

TL;DR

PrefVLM proposes a hybrid framework that uses Vision-Language Models to provide coarse trajectory-level preferences for reward learning in preference-based RL, significantly reducing human annotation needs. It couples VLM-based feedback with a lightweight, dynamics-aware adaptation and a KL-divergence–driven noise-mitigation strategy to refine the reward model. The approach achieves comparable or superior success to state-of-the-art baselines on five Meta-World manipulation tasks while requiring up to 2x less human feedback, and it demonstrates knowledge transfer across related tasks. This work demonstrates the practical scalability of foundation-model–assisted human-in-the-loop reinforcement learning for robotic manipulation.

Abstract

Preference-based reinforcement learning (RL) offers a promising approach for aligning policies with human intent but is often constrained by the high cost of human feedback. In this work, we introduce PrefVLM, a framework that integrates Vision-Language Models (VLMs) with selective human feedback to significantly reduce annotation requirements while maintaining performance. Our method leverages VLMs to generate initial preference labels, which are then filtered to identify uncertain cases for targeted human annotation. Additionally, we adapt VLMs using a self-supervised inverse dynamics loss to improve alignment with evolving policies. Experiments on Meta-World manipulation tasks demonstrate that PrefVLM achieves comparable or superior success rates to state-of-the-art methods while using up to 2 x fewer human annotations. Furthermore, we show that adapted VLMs enable efficient knowledge transfer across tasks, further minimizing feedback needs. Our results highlight the potential of combining VLMs with selective human supervision to make preference-based RL more scalable and practical.

Paper Structure

This paper contains 26 sections, 9 equations, 9 figures, 1 algorithm.

Figures (9)

  • Figure 1: PrefVLM leverages a VLM to obtain preference labels over pairs of the agent's trajectory segments. These preference labels are then used to train a reward function. In scenarios where the VLM exhibits high uncertainty, PrefVLM can seamlessly incorporate human feedback to refine its understanding and adapt the VLM to the specific environment. By combining machine-generated and expert-guided feedback, PrefVLM learns high-quality reward function while significantly reducing the amount of human supervision required compared to existing preference-based RL methods.
  • Figure 2: VLM reward (Eqn. \ref{['eq:cosine_sim']}) for an optimal trajectory given the task description "Open a door with a revolving joint." Although the reward reflects partial task progression, it is noisy and poorly aligned with the actual task progress, as evident from the image observations.
  • Figure 3: Overview of our approach. Given a task description, PrefVLM iteratively updates the policy $\pi_\phi$ via reinforcement learning using the reward model $r_\theta$. Trajectory segments from the replay buffer are sampled and labeled with VLM-generated preferences. These samples are then classified as clean or noisy using thresholds $\tau_{upper}$ and $\tau_{lower}$. A budgeted subset of noisy samples is sent for human annotation. The reward model is trained on both VLM and human-labeled preferences, while the VLM is fine-tuned using human annotations and replay buffer samples.
  • Figure 4: Learning curves for all methods on the 5 Meta-World tasks. PrefVLM consistently outperforms all baselines with minimal human feedback and matches or exceeds PEBBLE’s performance while using $2\times$ fewer annotations. Results are averaged over 5 seeds, with shaded regions indicating the standard error.
  • Figure 5: Success rate as a function of human feedback. PrefVLM leverages human feedback more efficiently by complementing it with VLM-based feedback, resulting in higher success rates with fewer human annotations. Results are averaged over 5 seeds, with shaded regions representing the standard error.
  • ...and 4 more figures