Table of Contents
Fetching ...

DiffusionRL: Efficient Training of Diffusion Policies for Robotic Grasping Using RL-Adapted Large-Scale Datasets

Maria Makarova, Qian Liu, Dzmitry Tsetserukou

TL;DR

This work tackles data and adaptation bottlenecks in diffusion-policy robotics by integrating reinforcement learning to enhance large-scale DexGraspNet datasets and a pose-sampling validation to train lightweight diffusion models for dexterous grasping with a ShadowHand. The approach couples an RL agent (TD3) to augment grasp demonstrations, a diffusion model (Conditional U-net within a DDPM) to predict action sequences, and a statistically driven pose sampler to validate performance on novel poses. Experiments show the pipeline yields high success rates (approximately $80%$) across three objects, while reducing manual data collection and improving generalization to unseen configurations. The method provides a scalable path to deploy diffusion-based policies in real-world robotic manipulation and can be extended to VLA-enhanced setups and broader tasks.

Abstract

Diffusion models have been successfully applied in areas such as image, video, and audio generation. Recent works show their promise for sequential decision-making and dexterous manipulation, leveraging their ability to model complex action distributions. However, challenges persist due to the data limitations and scenario-specific adaptation needs. In this paper, we address these challenges by proposing an optimized approach to training diffusion policies using large, pre-built datasets that are enhanced using Reinforcement Learning (RL). Our end-to-end pipeline leverages RL-based enhancement of the DexGraspNet dataset, lightweight diffusion policy training on a dexterous manipulation task for a five-fingered robotic hand, and a pose sampling algorithm for validation. The pipeline achieved a high success rate of 80% for three DexGraspNet objects. By eliminating manual data collection, our approach lowers barriers to adopting diffusion models in robotics, enhancing generalization and robustness for real-world applications.

DiffusionRL: Efficient Training of Diffusion Policies for Robotic Grasping Using RL-Adapted Large-Scale Datasets

TL;DR

This work tackles data and adaptation bottlenecks in diffusion-policy robotics by integrating reinforcement learning to enhance large-scale DexGraspNet datasets and a pose-sampling validation to train lightweight diffusion models for dexterous grasping with a ShadowHand. The approach couples an RL agent (TD3) to augment grasp demonstrations, a diffusion model (Conditional U-net within a DDPM) to predict action sequences, and a statistically driven pose sampler to validate performance on novel poses. Experiments show the pipeline yields high success rates (approximately ) across three objects, while reducing manual data collection and improving generalization to unseen configurations. The method provides a scalable path to deploy diffusion-based policies in real-world robotic manipulation and can be extended to VLA-enhanced setups and broader tasks.

Abstract

Diffusion models have been successfully applied in areas such as image, video, and audio generation. Recent works show their promise for sequential decision-making and dexterous manipulation, leveraging their ability to model complex action distributions. However, challenges persist due to the data limitations and scenario-specific adaptation needs. In this paper, we address these challenges by proposing an optimized approach to training diffusion policies using large, pre-built datasets that are enhanced using Reinforcement Learning (RL). Our end-to-end pipeline leverages RL-based enhancement of the DexGraspNet dataset, lightweight diffusion policy training on a dexterous manipulation task for a five-fingered robotic hand, and a pose sampling algorithm for validation. The pipeline achieved a high success rate of 80% for three DexGraspNet objects. By eliminating manual data collection, our approach lowers barriers to adopting diffusion models in robotics, enhancing generalization and robustness for real-world applications.

Paper Structure

This paper contains 9 sections, 1 equation, 10 figures, 1 table.

Figures (10)

  • Figure 1: Inappropriate DexGraspNet samples.
  • Figure 2: (a) RL-based Dataset Enhancement Pipeline. The numbers indicate the data flow sequence during the training of the RL agent: from DexGraspNet, the data on the object pose and actions for the gripper are sent to the Environment (1), from where the Observations vector is received by the RL agent (2, 3), after which it predicts additional gripper actions for the Environment (4, 5), where the final Reward is calculated to validate the success of the object grasping process (6); (b) Environmental Timestamps Details during RL-agent Training and Dataset Recording.
  • Figure 3: Actor and Critics Neural Networks architectures.
  • Figure 4: Diffusion Model Conditional U-net Architecture. Conditional Residual Blocks are highlighted in yellow.
  • Figure 5: Recalculated Dataset Sample.
  • ...and 5 more figures