Table of Contents
Fetching ...

GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion

Enda Xiang, Haoxiang Ma, Xinzhu Ma, Zicheng Liu, Di Huang

TL;DR

This paper incorporates grasp prior knowledge into the diffusion policy framework and introduces a self-supervised reconstruction objective during diffusion to embed the graspness prior, and demonstrates that this approach significantly outperforms baseline methods and exhibits strong dynamic grasping capabilities.

Abstract

This paper focuses on enhancing the grasping precision and generalization of manipulation policies learned via imitation learning. Diffusion-based policy learning methods have recently become the mainstream approach for robotic manipulation tasks. As grasping is a critical subtask in manipulation, the ability of imitation-learned policies to execute precise and generalizable grasps merits particular attention. Existing imitation learning techniques for grasping often suffer from imprecise grasp executions, limited spatial generalization, and poor object generalization. To address these challenges, we incorporate grasp prior knowledge into the diffusion policy framework. In particular, we employ a latent diffusion policy to guide action chunk decoding with grasp pose prior, ensuring that generated motion trajectories adhere closely to feasible grasp configurations. Furthermore, we introduce a self-supervised reconstruction objective during diffusion to embed the graspness prior: at each reverse diffusion step, we reconstruct wrist-camera images back-projected the graspness from the intermediate representations. Both simulation and real robot experiments demonstrate that our approach significantly outperforms baseline methods and exhibits strong dynamic grasping capabilities.

GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion

TL;DR

This paper incorporates grasp prior knowledge into the diffusion policy framework and introduces a self-supervised reconstruction objective during diffusion to embed the graspness prior, and demonstrates that this approach significantly outperforms baseline methods and exhibits strong dynamic grasping capabilities.

Abstract

This paper focuses on enhancing the grasping precision and generalization of manipulation policies learned via imitation learning. Diffusion-based policy learning methods have recently become the mainstream approach for robotic manipulation tasks. As grasping is a critical subtask in manipulation, the ability of imitation-learned policies to execute precise and generalizable grasps merits particular attention. Existing imitation learning techniques for grasping often suffer from imprecise grasp executions, limited spatial generalization, and poor object generalization. To address these challenges, we incorporate grasp prior knowledge into the diffusion policy framework. In particular, we employ a latent diffusion policy to guide action chunk decoding with grasp pose prior, ensuring that generated motion trajectories adhere closely to feasible grasp configurations. Furthermore, we introduce a self-supervised reconstruction objective during diffusion to embed the graspness prior: at each reverse diffusion step, we reconstruct wrist-camera images back-projected the graspness from the intermediate representations. Both simulation and real robot experiments demonstrate that our approach significantly outperforms baseline methods and exhibits strong dynamic grasping capabilities.
Paper Structure (20 sections, 14 equations, 10 figures, 10 tables)

This paper contains 20 sections, 14 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: We introduce GraspLDP, a generalizable grasping policy integrated with the prior from grasp detector via latent diffusion. Specifically, prior works generally ( a) predict the grasp pose (e.g. Anygrasp fang23anygrasp) or ( b) generate action sequence (e.g. Diffusion Policy DBLP:conf/rss/ChiFDXCBS23) for grasping. In contrast, ( c) our method extracts grasp priors from a pre-trained grasp detector for action refinement in latent space, and ( d) achieves substantial advantages over previous works in diverse grasping tasks.
  • Figure 2: Framework of proposed GraspLDP. In Action Latent Learning stage action chunks are refined under the guidance of a grasp pose in latent space encoded by a VAE. In Diffusion on Latent Action Space stage the graspness cue is used to condition the diffusion model’s denoising process and to reconstruct for enhancement.
  • Figure 3: Inference Pre-process presents our inference pipeline with Heuristic Pose Selector.
  • Figure 4: Inference latency of three methods on an RTX 4090 GPU, with the policy action horizon aligned to 8 for each inference. Results of GraspVLA are after acceleration with $torch.compile()$.
  • Figure 5: Qualitative experimental analysis. (a) Grasping trials using objects "mug", "mustard bottle", and "thera med" in simulator. (b) Real world grasping trials corresponding to in domain, object generation, and visual generation performance. In particular, we use colored LED strips in low-light conditions to simulate visual interference.
  • ...and 5 more figures