Table of Contents
Fetching ...

Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning

Qi Wang, Mian Wu, Yuyang Zhang, Mingqi Yuan, Wenyao Zhang, Haoxiang You, Yunbo Wang, Xin Jin, Xiaokang Yang, Wenjun Zeng

TL;DR

GenReward addresses reward design in reinforcement learning by leveraging pretrained video diffusion models to provide goal-driven signals. It introduces a video-level reward based on latent similarity to generated goal videos and a frame-level reward via a forward–backward representation guided by a CLIP-selected frame, combining them with the environment reward. The approach achieves competitive to superior performance on Meta-World manipulation tasks and demonstrates robustness across domain shifts, while acknowledging extra computation as a limitation. This work meaningfully reduces reward engineering effort and showcases how generative priors can guide goal-directed behavior in complex tasks.

Abstract

Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior. Designing such reward functions can be challenging and may not generalize well across different tasks. To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goal-driven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals. For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the generated goal videos. To enable more fine-grained goal-achievement, we derive a frame-level goal by identifying the most relevant frame from the generated video using CLIP, which serves as the goal state. We then employ a learned forward-backward representation that represents the probability of visiting the goal state from a given state-action pair as frame-level reward, promoting more coherent and goal-driven trajectories. Experiments on various Meta-World tasks demonstrate the effectiveness of our approach.

Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning

TL;DR

GenReward addresses reward design in reinforcement learning by leveraging pretrained video diffusion models to provide goal-driven signals. It introduces a video-level reward based on latent similarity to generated goal videos and a frame-level reward via a forward–backward representation guided by a CLIP-selected frame, combining them with the environment reward. The approach achieves competitive to superior performance on Meta-World manipulation tasks and demonstrates robustness across domain shifts, while acknowledging extra computation as a limitation. This work meaningfully reduces reward engineering effort and showcases how generative priors can guide goal-directed behavior in complex tasks.

Abstract

Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior. Designing such reward functions can be challenging and may not generalize well across different tasks. To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goal-driven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals. For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the generated goal videos. To enable more fine-grained goal-achievement, we derive a frame-level goal by identifying the most relevant frame from the generated video using CLIP, which serves as the goal state. We then employ a learned forward-backward representation that represents the probability of visiting the goal state from a given state-action pair as frame-level reward, promoting more coherent and goal-driven trajectories. Experiments on various Meta-World tasks demonstrate the effectiveness of our approach.

Paper Structure

This paper contains 24 sections, 10 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Overview of our proposed framework. The key idea is to leverage generated goal-conditioned videos for world knowledge transfer, enabling the downstream agent to improve performance on unseen tasks.
  • Figure 2: Pipeline of GenReward, which computes goal-driven rewards for behavior learning of the agent using generative prior. During online interaction with the environment, at regular intervals, we employ the correlation between the latent representations of the agent's observations and the generated goal videos as video-level rewards. Meanwhile, we learn a forward-backward model to measure the probability of reaching the goal state that is selected using CLIP from a given state–action pair, providing frame-level reward for fine-grained goal-achievement.
  • Figure 3: Goal-driven action selection. Learned representation space enables goal-directed control by selecting the action whose forward representation of the current state–action pair most closely aligns with the backward representation of goal state.
  • Figure 4: Illustration of experimental setups in our experiments with generated videos and image observations from environments.
  • Figure 5: Performance on Meta-World complex manipulation tasks in terms of episode return under dense reward setting.
  • ...and 6 more figures