Table of Contents
Fetching ...

Dual-Granularity Contrastive Reward via Generated Episodic Guidance for Efficient Embodied RL

Xin Liu, Yixuan Li, Yuhui Chen, Yuxing Qin, Haoran Li, Dongbin Zhao

TL;DR

Extensive experiments show that DEG can not only serve as an efficient exploration stimulus to help the agent quickly discover sparse success rewards, but also guide effective RL and stable policy convergence independently.

Abstract

Designing suitable rewards poses a significant challenge in reinforcement learning (RL), especially for embodied manipulation. Trajectory success rewards are suitable for human judges or model fitting, but the sparsity severely limits RL sample efficiency. While recent methods have effectively improved RL via dense rewards, they rely heavily on high-quality human-annotated data or abundant expert supervision. To tackle these issues, this paper proposes Dual-granularity contrastive reward via generated Episodic Guidance (DEG), a novel framework to seek sample-efficient dense rewards without requiring human annotations or extensive supervision. Leveraging the prior knowledge of large video generation models, DEG only needs a small number of expert videos for domain adaptation to generate dedicated task guidance for each RL episode. Then, the proposed dual-granularity reward that balances coarse-grained exploration and fine-grained matching, will guide the agent to efficiently approximate the generated guidance video sequentially in the contrastive self-supervised latent space, and finally complete the target task. Extensive experiments on 18 diverse tasks across both simulation and real-world settings show that DEG can not only serve as an efficient exploration stimulus to help the agent quickly discover sparse success rewards, but also guide effective RL and stable policy convergence independently.

Dual-Granularity Contrastive Reward via Generated Episodic Guidance for Efficient Embodied RL

TL;DR

Extensive experiments show that DEG can not only serve as an efficient exploration stimulus to help the agent quickly discover sparse success rewards, but also guide effective RL and stable policy convergence independently.

Abstract

Designing suitable rewards poses a significant challenge in reinforcement learning (RL), especially for embodied manipulation. Trajectory success rewards are suitable for human judges or model fitting, but the sparsity severely limits RL sample efficiency. While recent methods have effectively improved RL via dense rewards, they rely heavily on high-quality human-annotated data or abundant expert supervision. To tackle these issues, this paper proposes Dual-granularity contrastive reward via generated Episodic Guidance (DEG), a novel framework to seek sample-efficient dense rewards without requiring human annotations or extensive supervision. Leveraging the prior knowledge of large video generation models, DEG only needs a small number of expert videos for domain adaptation to generate dedicated task guidance for each RL episode. Then, the proposed dual-granularity reward that balances coarse-grained exploration and fine-grained matching, will guide the agent to efficiently approximate the generated guidance video sequentially in the contrastive self-supervised latent space, and finally complete the target task. Extensive experiments on 18 diverse tasks across both simulation and real-world settings show that DEG can not only serve as an efficient exploration stimulus to help the agent quickly discover sparse success rewards, but also guide effective RL and stable policy convergence independently.
Paper Structure (31 sections, 10 equations, 22 figures, 1 table)

This paper contains 31 sections, 10 equations, 22 figures, 1 table.

Figures (22)

  • Figure 1: The pipeline of the proposed DEG. Without requiring human annotations or extensive supervision, DEG enables sample-efficient RL via dual-granularity contrastive dense reward based on the generated episodic video guidance.
  • Figure 2: The effect diagram of coarse-grained exploration reward and fine-grained matching reward. The top-left panel shows a schematic of the expert trajectory for the plate slide task. We take the 2D trajectory of the task's latter half (the arm pushing the plate to the target on the plane) as an example (bottom-left), with the expert guidance, coarse-grained threshold, and fine-grained threshold represented in different colors. Note that this simple example is set to clearly illustrate the core idea of our method. In practical tasks, the expert trajectories are complex 3D curves. The right panel illustrates how different trajectories trigger the two rewards, respectively. Coarse-grained rewards use a larger threshold to guide sequential target imitation, which encourages the robotic arm to roughly mimic the movement intent in the guidance. However, (i) the larger threshold tolerates operations with deviations, making it hard to learn precise interactions ('reaching target' in this task); (ii) its sequential imitation goals mean that trajectories that deviate at first but later correct and even succeed will no longer receive rewards. Fine-grained rewards directly tackle these two problems: they not only prioritize rewarding precise interactions to further refine the policy, but also ensure that trajectories achieving final success without sequential imitation receive positive feedback, thus reducing agent confusion.
  • Figure 3: Comparison with state-of-the-art reward engineering methods on 12 task-free tasks. DEG achieves better performance on both the final policy level and RL sample efficiency. Owing to the non-open-source implementation or constraints on computational resources, we employ the results of TeViR and RoboCLIP provided by previous works tevir only for the first eight tasks.
  • Figure 4: With success sparse reward, DEG+ can effectively improve RL efficiency and match human expert-annotated dense reward. It even outperforms expert dense reward across several tasks, such as assembly, hammer, and drawer-open.
  • Figure 5: Smoothed training curves on real-world Franka manipulation tasks. DEG enables better RL efficiency and lower intervention rates than employing only the success sparse reward.
  • ...and 17 more figures