Table of Contents
Fetching ...

Reward Generation via Large Vision-Language Model in Offline Reinforcement Learning

Younghwan Lee, Tung M. Luu, Donghoon Lee, Chang D. Yoo

TL;DR

Offline reinforcement learning with fixed datasets benefits from automated, dense reward signals but often suffers from sparse rewards and costly human labeling. RG-VLM leverages Large Vision-Language Models to generate per-transition rewards from offline data in a two-stage querying process, enabling language-conditioned IQL to learn from richer feedback. The approach improves long-horizon task generalization and can augment sparse rewards, as shown by superior returns on a robotics-like ALFRED dataset and favorable generalization under randomized initial states. This scalable reward-generation framework holds promise for broadening offline RL applicability without human-in-the-loop labeling.

Abstract

In offline reinforcement learning (RL), learning from fixed datasets presents a promising solution for domains where real-time interaction with the environment is expensive or risky. However, designing dense reward signals for offline dataset requires significant human effort and domain expertise. Reinforcement learning with human feedback (RLHF) has emerged as an alternative, but it remains costly due to the human-in-the-loop process, prompting interest in automated reward generation models. To address this, we propose Reward Generation via Large Vision-Language Models (RG-VLM), which leverages the reasoning capabilities of LVLMs to generate rewards from offline data without human involvement. RG-VLM improves generalization in long-horizon tasks and can be seamlessly integrated with the sparse reward signals to enhance task performance, demonstrating its potential as an auxiliary reward signal.

Reward Generation via Large Vision-Language Model in Offline Reinforcement Learning

TL;DR

Offline reinforcement learning with fixed datasets benefits from automated, dense reward signals but often suffers from sparse rewards and costly human labeling. RG-VLM leverages Large Vision-Language Models to generate per-transition rewards from offline data in a two-stage querying process, enabling language-conditioned IQL to learn from richer feedback. The approach improves long-horizon task generalization and can augment sparse rewards, as shown by superior returns on a robotics-like ALFRED dataset and favorable generalization under randomized initial states. This scalable reward-generation framework holds promise for broadening offline RL applicability without human-in-the-loop labeling.

Abstract

In offline reinforcement learning (RL), learning from fixed datasets presents a promising solution for domains where real-time interaction with the environment is expensive or risky. However, designing dense reward signals for offline dataset requires significant human effort and domain expertise. Reinforcement learning with human feedback (RLHF) has emerged as an alternative, but it remains costly due to the human-in-the-loop process, prompting interest in automated reward generation models. To address this, we propose Reward Generation via Large Vision-Language Models (RG-VLM), which leverages the reasoning capabilities of LVLMs to generate rewards from offline data without human involvement. RG-VLM improves generalization in long-horizon tasks and can be seamlessly integrated with the sparse reward signals to enhance task performance, demonstrating its potential as an auxiliary reward signal.

Paper Structure

This paper contains 10 sections, 3 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Unlike RLHF, we leverage the advanced reasoning capabilities of LVLM to generate rewards, eliminating the need for human involvement during the reward labeling process. This provides a more flexible and scalable approach that can be applied to a wide range of tasks.
  • Figure 2: Algorithm Flow of RG-VLM and Offline RL training. In the RG-VLM reward generation process, the LVLM analyzes sequences of visual observations, actions, and task goals to generate rewards for the offline dataset. The reward labeled offline dataset is then used for offline RL training.
  • Figure 3: Example of RG-VLM querying process. The LVLM is queried in two stages. In the first stage, the LVLM analyzes sequences of visual observations, actions, and the task goal to understand how actions affect task progression. In the second stage, the LVLM assigns reward scores (0 to 10) based on each action's contribution to the task.
  • Figure 4: Proportion of Task Completion for 5 different methods across tasks with 1 to 6 sub-tasks. Left: Comparison of IQL with Sparse and RG-VLM rewards against other VLM-based methods and Sparse rewards only. Right: Ablation study showing the impact of combining RG-VLM with sparse rewards across various task lengths.