Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning
Udita Ghosh, Dripta S. Raychaudhuri, Jiachen Li, Konstantinos Karydis, Amit Roy-Chowdhury
TL;DR
PrefVLM proposes a hybrid framework that uses Vision-Language Models to provide coarse trajectory-level preferences for reward learning in preference-based RL, significantly reducing human annotation needs. It couples VLM-based feedback with a lightweight, dynamics-aware adaptation and a KL-divergence–driven noise-mitigation strategy to refine the reward model. The approach achieves comparable or superior success to state-of-the-art baselines on five Meta-World manipulation tasks while requiring up to 2x less human feedback, and it demonstrates knowledge transfer across related tasks. This work demonstrates the practical scalability of foundation-model–assisted human-in-the-loop reinforcement learning for robotic manipulation.
Abstract
Preference-based reinforcement learning (RL) offers a promising approach for aligning policies with human intent but is often constrained by the high cost of human feedback. In this work, we introduce PrefVLM, a framework that integrates Vision-Language Models (VLMs) with selective human feedback to significantly reduce annotation requirements while maintaining performance. Our method leverages VLMs to generate initial preference labels, which are then filtered to identify uncertain cases for targeted human annotation. Additionally, we adapt VLMs using a self-supervised inverse dynamics loss to improve alignment with evolving policies. Experiments on Meta-World manipulation tasks demonstrate that PrefVLM achieves comparable or superior success rates to state-of-the-art methods while using up to 2 x fewer human annotations. Furthermore, we show that adapted VLMs enable efficient knowledge transfer across tasks, further minimizing feedback needs. Our results highlight the potential of combining VLMs with selective human supervision to make preference-based RL more scalable and practical.
