DiffusionRL: Efficient Training of Diffusion Policies for Robotic Grasping Using RL-Adapted Large-Scale Datasets
Maria Makarova, Qian Liu, Dzmitry Tsetserukou
TL;DR
This work tackles data and adaptation bottlenecks in diffusion-policy robotics by integrating reinforcement learning to enhance large-scale DexGraspNet datasets and a pose-sampling validation to train lightweight diffusion models for dexterous grasping with a ShadowHand. The approach couples an RL agent (TD3) to augment grasp demonstrations, a diffusion model (Conditional U-net within a DDPM) to predict action sequences, and a statistically driven pose sampler to validate performance on novel poses. Experiments show the pipeline yields high success rates (approximately $80%$) across three objects, while reducing manual data collection and improving generalization to unseen configurations. The method provides a scalable path to deploy diffusion-based policies in real-world robotic manipulation and can be extended to VLA-enhanced setups and broader tasks.
Abstract
Diffusion models have been successfully applied in areas such as image, video, and audio generation. Recent works show their promise for sequential decision-making and dexterous manipulation, leveraging their ability to model complex action distributions. However, challenges persist due to the data limitations and scenario-specific adaptation needs. In this paper, we address these challenges by proposing an optimized approach to training diffusion policies using large, pre-built datasets that are enhanced using Reinforcement Learning (RL). Our end-to-end pipeline leverages RL-based enhancement of the DexGraspNet dataset, lightweight diffusion policy training on a dexterous manipulation task for a five-fingered robotic hand, and a pose sampling algorithm for validation. The pipeline achieved a high success rate of 80% for three DexGraspNet objects. By eliminating manual data collection, our approach lowers barriers to adopting diffusion models in robotics, enhancing generalization and robustness for real-world applications.
