Table of Contents
Fetching ...

FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning

Yuwei Fu, Haichao Zhang, Di Wu, Wei Xu, Benoit Boulet

TL;DR

FuRL tackles sparse-reward RL by leveraging pre-trained Vision-Language Models while addressing reward fuzziness. It introduces reward alignment to fine-tune VLM-based rewards with lightweight projection heads and a contrastive/ ranking loss, plus Relay RL to escape local minima during exploration. The approach yields improved performance on Meta-World MT10 over baselines and remains effective with pixel-based observations, suggesting practical viability for VLM-assisted online RL. By combining alignment and staged exploration, FuRL enhances sample efficiency and robust policy learning in visually grounded tasks.

Abstract

In this work, we investigate how to leverage pre-trained visual-language models (VLM) for online Reinforcement Learning (RL). In particular, we focus on sparse reward tasks with pre-defined textual task descriptions. We first identify the problem of reward misalignment when applying VLM as a reward in RL tasks. To address this issue, we introduce a lightweight fine-tuning method, named Fuzzy VLM reward-aided RL (FuRL), based on reward alignment and relay RL. Specifically, we enhance the performance of SAC/DrQ baseline agents on sparse reward tasks by fine-tuning VLM representations and using relay RL to avoid local minima. Extensive experiments on the Meta-world benchmark tasks demonstrate the efficacy of the proposed method. Code is available at: https://github.com/fuyw/FuRL.

FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning

TL;DR

FuRL tackles sparse-reward RL by leveraging pre-trained Vision-Language Models while addressing reward fuzziness. It introduces reward alignment to fine-tune VLM-based rewards with lightweight projection heads and a contrastive/ ranking loss, plus Relay RL to escape local minima during exploration. The approach yields improved performance on Meta-World MT10 over baselines and remains effective with pixel-based observations, suggesting practical viability for VLM-assisted online RL. By combining alignment and staged exploration, FuRL enhances sample efficiency and robust policy learning in visually grounded tasks.

Abstract

In this work, we investigate how to leverage pre-trained visual-language models (VLM) for online Reinforcement Learning (RL). In particular, we focus on sparse reward tasks with pre-defined textual task descriptions. We first identify the problem of reward misalignment when applying VLM as a reward in RL tasks. To address this issue, we introduce a lightweight fine-tuning method, named Fuzzy VLM reward-aided RL (FuRL), based on reward alignment and relay RL. Specifically, we enhance the performance of SAC/DrQ baseline agents on sparse reward tasks by fine-tuning VLM representations and using relay RL to avoid local minima. Extensive experiments on the Meta-world benchmark tasks demonstrate the efficacy of the proposed method. Code is available at: https://github.com/fuyw/FuRL.
Paper Structure (31 sections, 5 equations, 17 figures, 6 tables, 1 algorithm)

This paper contains 31 sections, 5 equations, 17 figures, 6 tables, 1 algorithm.

Figures (17)

  • Figure 1: Raw VLM reward is sub-optimal to teach RL agents. In this example, the text instruction $l$ is "press a button from the top". We plot the cosine similarity-based VLM reward with language embedding $\Phi_L(l)$ and image embedding $\Phi_I(o_{t})$ and also show the distance between the end-effector and the goal. It can be observed that the cosine similarity between $\Phi_L(l)$ and $\Phi_I(o_{t})$ can reflect some aspects of the task but is not always well aligned with the task progress, reflecting the fuzzy aspects of the VLM reward.
  • Figure 2: Fuzzy VLM reward effect. Visualization of end-effector trajectory in terms of $(x, y)$ positions. Oracle denotes an expert policy. VLM denotes the policy trained using sparse task reward together with VLM reward.
  • Figure 3: Illustration of the proposed method: (left) the overall pipeline of FuRL. (right) FuRL freezes the pre-trained VLM and only fine-tunes two MLP-based projection heads $f_{W_L}$, $f_{W_I}$.
  • Figure 4: Contrastive learning loss: (left) without any successful trajectories, we can use L2 distance w.r.t. an goal image to rank the goodness of two negative samples; (right) when we collected some successful trajectories, the contrastive loss learns to distinguish samples from both of the successful and unsuccessful trajectories.
  • Figure 5: Illustration of the relay RL based exploration: at the beginning of an episode $\tau_i$, we randomly select a relay step $T_i$. We iteratively unroll a VLM agent and a SAC agent for $T_i$ steps and save the collected samples in a shared buffer. We turn off the relay exploration when we collected 2500 positive samples from the successful trajectories.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Definition 3.1