Table of Contents
Fetching ...

MVR: Multi-view Video Reward Shaping for Reinforcement Learning

Lirui Luo, Guoxi Zhang, Hongming Xu, Yaodong Yang, Cong Fang, Qing Li

TL;DR

Multi-View Video Reward Shaping (MVR) is presented, a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints and introduces a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance.

Abstract

Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent's behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.

MVR: Multi-view Video Reward Shaping for Reinforcement Learning

TL;DR

Multi-View Video Reward Shaping (MVR) is presented, a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints and introduces a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance.

Abstract

Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent's behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.
Paper Structure (65 sections, 13 equations, 8 figures, 24 tables, 1 algorithm)

This paper contains 65 sections, 13 equations, 8 figures, 24 tables, 1 algorithm.

Figures (8)

  • Figure 1: The proposed MVR computes visual guidance using a VLM and videos collected from multiple viewpoints. In this example, the task requires a humanoid robot to run forward. Being captured from different viewpoints, the image sequences encode complementary information and enable comprehensive evaluation of the agent's behaviors. This example also illustrates the pitfall of using image-text similarity for dynamic tasks—running requires rhythmic alternation of legs, but optimizing image-text similarity leads to realizing the pose that best matches "running" repeatedly. The shaping term prioritizes states that establish alternating leg cadence; once a stable gait and the target forward speed are achieved, its influence automatically decreases and the task reward takes precedence (see \ref{['method']}).
  • Figure 2: The entire framework of the proposed MVR. MVR periodically samples state sequences and renders them into videos from different viewpoints (step 2). It then queries a VLM for the similarity scores and video embeddings of the videos and augments its dataset $\mathcal{D}$ with state sequences, video embeddings, and similarity scores (step 3). Additionally, it keeps the state sequences with top-$k$ similarity scores in a reference set $\mathcal{D}^\text{ref}$. With $\mathcal{D}$, MVR updates the state relevance model (step 4). Lastly, using the latest state relevance model and the reference set, MVR computes visual feedback $r^\text{VLM}$ for the online RL agent (step 1), which is combined with task rewards $r^\text{task}$ through state-dependent reward shaping that automatically decays as the agent's behavior aligns with the reference set.
  • Figure 3: Method ablation and the influence of the number of views.
  • Figure 4: MVR identifies suboptimal states through reward shaping. We visualize states generated by a TQC agent for the Sit_Hard task and annotate them with $r^\text{task}$ (left) and the shaped reward $r^\text{MVR} = r^\text{task} + w r^\text{VLM}$ (right) computed by a trained MVR agent. While task rewards are high when the agent is close to the chair, MVR's visual guidance component assigns low values to improper sitting poses (sitting on chair's leg, leaning, or sitting at the edge), effectively shaping the reward landscape to discourage these visually suboptimal but task-rewarding states.
  • Figure A1: Top-/bottom-ranked frames by $r^\text{task}$ and $r^\text{MVR}$ in the Sit_Hard task. The first two rows show the top and bottom frames under the task reward $r^\text{task}$, while the last two rows show the corresponding frames under the shaped reward $r^\text{MVR}$. For both rewards, the highest-ranked frames coincide with visually unambiguous successful sitting poses, confirming that MVR preserves the optimal behavior emphasized by the task reward. The bottom-ranked frames are dominated by clearly failed attempts under both signals, serving mainly as a sanity check that obviously bad states receive low values. Together with the qualitative analysis in \ref{['fig:demo']}, this suggests that the main effect of MVR is to reshuffle intermediate-reward states, down-weighting visually unstable but task-rewarding poses while leaving the optimal behavior unchanged.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 4.1