Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation
Huy Le, Tai Hoang, Miroslav Gabriel, Gerhard Neumann, Ngo Anh Vien
TL;DR
This work tackles exploration and generalization in hybrid off-policy reinforcement learning for non-prehensile manipulation by integrating a diffusion-based policy for continuous motion parameters with a discrete action (contact location) component optimized via Q-values within a Soft Actor-Critic framework. The authors derive a principled objective, showing the entropy term is bounded below through structured variational inference, and implement this within the Hybrid Diffusion Policy (HyDo) algorithm, including a variant using Consistency Models (HyDo+CM). Empirical evaluation across simulation and zero-shot sim2real tasks demonstrates that diffusion-based policies with entropy regularization yield more diverse and robust behaviors, significantly improving real-world success rates (e.g., from $53\%$ to $72\%$ on a 6D pose alignment task) and enhancing generalization to unseen objects. The results underscore the importance of multi-modal exploration in hybrid action spaces and suggest diffusion-based, entropy-regularized RL as a promising approach for complex manipulation tasks with real-world applicability. The findings have practical impact for robust autonomous manipulation, offering improved transfer from simulation to real hardware and greater adaptability in cluttered, varied environments.
Abstract
Learning diverse policies for non-prehensile manipulation is essential for improving skill transfer and generalization to out-of-distribution scenarios. In this work, we enhance exploration through a two-fold approach within a hybrid framework that tackles both discrete and continuous action spaces. First, we model the continuous motion parameter policy as a diffusion model, and second, we incorporate this into a maximum entropy reinforcement learning framework that unifies both the discrete and continuous components. The discrete action space, such as contact point selection, is optimized through Q-value function maximization, while the continuous part is guided by a diffusion-based policy. This hybrid approach leads to a principled objective, where the maximum entropy term is derived as a lower bound using structured variational inference. We propose the Hybrid Diffusion Policy algorithm (HyDo) and evaluate its performance on both simulation and zero-shot sim2real tasks. Our results show that HyDo encourages more diverse behavior policies, leading to significantly improved success rates across tasks - for example, increasing from 53% to 72% on a real-world 6D pose alignment task. Project page: https://leh2rng.github.io/hydo
