Reward Generation via Large Vision-Language Model in Offline Reinforcement Learning
Younghwan Lee, Tung M. Luu, Donghoon Lee, Chang D. Yoo
TL;DR
Offline reinforcement learning with fixed datasets benefits from automated, dense reward signals but often suffers from sparse rewards and costly human labeling. RG-VLM leverages Large Vision-Language Models to generate per-transition rewards from offline data in a two-stage querying process, enabling language-conditioned IQL to learn from richer feedback. The approach improves long-horizon task generalization and can augment sparse rewards, as shown by superior returns on a robotics-like ALFRED dataset and favorable generalization under randomized initial states. This scalable reward-generation framework holds promise for broadening offline RL applicability without human-in-the-loop labeling.
Abstract
In offline reinforcement learning (RL), learning from fixed datasets presents a promising solution for domains where real-time interaction with the environment is expensive or risky. However, designing dense reward signals for offline dataset requires significant human effort and domain expertise. Reinforcement learning with human feedback (RLHF) has emerged as an alternative, but it remains costly due to the human-in-the-loop process, prompting interest in automated reward generation models. To address this, we propose Reward Generation via Large Vision-Language Models (RG-VLM), which leverages the reasoning capabilities of LVLMs to generate rewards from offline data without human involvement. RG-VLM improves generalization in long-horizon tasks and can be seamlessly integrated with the sparse reward signals to enhance task performance, demonstrating its potential as an auxiliary reward signal.
