Table of Contents
Fetching ...

Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation

Huy Le, Tai Hoang, Miroslav Gabriel, Gerhard Neumann, Ngo Anh Vien

TL;DR

This work tackles exploration and generalization in hybrid off-policy reinforcement learning for non-prehensile manipulation by integrating a diffusion-based policy for continuous motion parameters with a discrete action (contact location) component optimized via Q-values within a Soft Actor-Critic framework. The authors derive a principled objective, showing the entropy term is bounded below through structured variational inference, and implement this within the Hybrid Diffusion Policy (HyDo) algorithm, including a variant using Consistency Models (HyDo+CM). Empirical evaluation across simulation and zero-shot sim2real tasks demonstrates that diffusion-based policies with entropy regularization yield more diverse and robust behaviors, significantly improving real-world success rates (e.g., from $53\%$ to $72\%$ on a 6D pose alignment task) and enhancing generalization to unseen objects. The results underscore the importance of multi-modal exploration in hybrid action spaces and suggest diffusion-based, entropy-regularized RL as a promising approach for complex manipulation tasks with real-world applicability. The findings have practical impact for robust autonomous manipulation, offering improved transfer from simulation to real hardware and greater adaptability in cluttered, varied environments.

Abstract

Learning diverse policies for non-prehensile manipulation is essential for improving skill transfer and generalization to out-of-distribution scenarios. In this work, we enhance exploration through a two-fold approach within a hybrid framework that tackles both discrete and continuous action spaces. First, we model the continuous motion parameter policy as a diffusion model, and second, we incorporate this into a maximum entropy reinforcement learning framework that unifies both the discrete and continuous components. The discrete action space, such as contact point selection, is optimized through Q-value function maximization, while the continuous part is guided by a diffusion-based policy. This hybrid approach leads to a principled objective, where the maximum entropy term is derived as a lower bound using structured variational inference. We propose the Hybrid Diffusion Policy algorithm (HyDo) and evaluate its performance on both simulation and zero-shot sim2real tasks. Our results show that HyDo encourages more diverse behavior policies, leading to significantly improved success rates across tasks - for example, increasing from 53% to 72% on a real-world 6D pose alignment task. Project page: https://leh2rng.github.io/hydo

Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation

TL;DR

This work tackles exploration and generalization in hybrid off-policy reinforcement learning for non-prehensile manipulation by integrating a diffusion-based policy for continuous motion parameters with a discrete action (contact location) component optimized via Q-values within a Soft Actor-Critic framework. The authors derive a principled objective, showing the entropy term is bounded below through structured variational inference, and implement this within the Hybrid Diffusion Policy (HyDo) algorithm, including a variant using Consistency Models (HyDo+CM). Empirical evaluation across simulation and zero-shot sim2real tasks demonstrates that diffusion-based policies with entropy regularization yield more diverse and robust behaviors, significantly improving real-world success rates (e.g., from to on a 6D pose alignment task) and enhancing generalization to unseen objects. The results underscore the importance of multi-modal exploration in hybrid action spaces and suggest diffusion-based, entropy-regularized RL as a promising approach for complex manipulation tasks with real-world applicability. The findings have practical impact for robust autonomous manipulation, offering improved transfer from simulation to real hardware and greater adaptability in cluttered, varied environments.

Abstract

Learning diverse policies for non-prehensile manipulation is essential for improving skill transfer and generalization to out-of-distribution scenarios. In this work, we enhance exploration through a two-fold approach within a hybrid framework that tackles both discrete and continuous action spaces. First, we model the continuous motion parameter policy as a diffusion model, and second, we incorporate this into a maximum entropy reinforcement learning framework that unifies both the discrete and continuous components. The discrete action space, such as contact point selection, is optimized through Q-value function maximization, while the continuous part is guided by a diffusion-based policy. This hybrid approach leads to a principled objective, where the maximum entropy term is derived as a lower bound using structured variational inference. We propose the Hybrid Diffusion Policy algorithm (HyDo) and evaluate its performance on both simulation and zero-shot sim2real tasks. Our results show that HyDo encourages more diverse behavior policies, leading to significantly improved success rates across tasks - for example, increasing from 53% to 72% on a real-world 6D pose alignment task. Project page: https://leh2rng.github.io/hydo

Paper Structure

This paper contains 21 sections, 16 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of HyDo: The network takes point clouds, goal flow, and segmentation (indicating object and background points) as input. These are passed through the actor and critic networks. The actor is enhanced exploration on the continuous motion parameter with the entropy regularizer applied during the diffusion process and outputs the motion parameter. The state-action pair is then evaluated by the critic, which also integrates entropy regularization for exploration on the discrete contact location. The action parameter and contact location with the highest Q-value is selected and executed by the robot.
  • Figure 2: Set of five objects used for real robot evaluations.
  • Figure 3: A real robot task showcases the multi-modalities of action sequences, (top) Push$\rightarrow$ Push $\rightarrow$ Push$\rightarrow$ Push$\rightarrow$ Flip; (bottom) Push$\rightarrow$ Push $\rightarrow$ Push$\rightarrow$ Flip$\rightarrow$ Push. In this task, we fixed the goal and initial pose and generated the action sequences with two different random seeds.
  • Figure 4: Left: A pushing task with $24$ behavioral modes. Right: Behavior entropy of different methodologies.
  • Figure 5: Left: A push and align task with $2$ behavioral modes. Right: Pareto plot between Entropy and Success Rates.
  • ...and 1 more figures