Table of Contents
Fetching ...

GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy

Yixuan Wang, Guang Yin, Binghao Huang, Tarik Kelestemur, Jiuguang Wang, Yunzhu Li

TL;DR

A novel framework is introduced that incorporates explicit spatial and semantic information via 3D semantic fields in Diffusion Policy, enabling strong generalization capabilities in tasks requiring category-level generalization, resolving geometric ambiguities, and attention to subtle geometric details.

Abstract

Diffusion-based policies have shown remarkable capability in executing complex robotic manipulation tasks but lack explicit characterization of geometry and semantics, which often limits their ability to generalize to unseen objects and layouts. To enhance the generalization capabilities of Diffusion Policy, we introduce a novel framework that incorporates explicit spatial and semantic information via 3D semantic fields. We generate 3D descriptor fields from multi-view RGBD observations with large foundational vision models, then compare these descriptor fields against reference descriptors to obtain semantic fields. The proposed method explicitly considers geometry and semantics, enabling strong generalization capabilities in tasks requiring category-level generalization, resolving geometric ambiguities, and attention to subtle geometric details. We evaluate our method across eight tasks involving articulated objects and instances with varying shapes and textures from multiple object categories. Our method demonstrates its effectiveness by increasing Diffusion Policy's average success rate on unseen instances from 20% to 93%. Additionally, we provide a detailed analysis and visualization to interpret the sources of performance gain and explain how our method can generalize to novel instances.

GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy

TL;DR

A novel framework is introduced that incorporates explicit spatial and semantic information via 3D semantic fields in Diffusion Policy, enabling strong generalization capabilities in tasks requiring category-level generalization, resolving geometric ambiguities, and attention to subtle geometric details.

Abstract

Diffusion-based policies have shown remarkable capability in executing complex robotic manipulation tasks but lack explicit characterization of geometry and semantics, which often limits their ability to generalize to unseen objects and layouts. To enhance the generalization capabilities of Diffusion Policy, we introduce a novel framework that incorporates explicit spatial and semantic information via 3D semantic fields. We generate 3D descriptor fields from multi-view RGBD observations with large foundational vision models, then compare these descriptor fields against reference descriptors to obtain semantic fields. The proposed method explicitly considers geometry and semantics, enabling strong generalization capabilities in tasks requiring category-level generalization, resolving geometric ambiguities, and attention to subtle geometric details. We evaluate our method across eight tasks involving articulated objects and instances with varying shapes and textures from multiple object categories. Our method demonstrates its effectiveness by increasing Diffusion Policy's average success rate on unseen instances from 20% to 93%. Additionally, we provide a detailed analysis and visualization to interpret the sources of performance gain and explain how our method can generalize to novel instances.

Paper Structure

This paper contains 15 sections, 4 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Generalizable Diffusion Policy using 3D Semantic Fields. Our approach introduces a diffusion policy capable of generalizing to new instances within a category by utilizing 3D semantic fields. These fields distinguish semantically meaningful parts of objects in 3D space, as illustrated in the heatmap example. Panel (a) on the left showcases daily task examples where semantic understanding is crucial, while panel (b) on the right demonstrates our method's ability to highlight semantically meaningful parts, such as a knife handle, and how our predicted policy accomplishes these tasks using the 3D semantic fields.
  • Figure 2: Method Overview. The top row (a) shows a sequence of real policy rollouts in the aligning shoe task. We first take in multi-view RGBD observations (i), then extract the 3D descriptor field, with each point possessing a corresponding high-dimensional descriptor (ii) wang2023d3fields. We then select reference features from 2D reference images. By computing the cosine similarity between the descriptor field and 2D reference semantic features, we could obtain several semantic fields (iii). These semantic fields, concatenated with the point cloud, are then input into PointNet++ and the diffusion policy to output predicted actions (iv).
  • Figure 3: Real Experiment Setup. (a) We use four RealSense cameras to capture RGBD observations and ALOHA robots to execute policy. (b) We test on a diverse set of objects, including shoes, soda cans, marker pens, knives, spoons, toothbrushes, and toothpaste, with diverse geometry and appearance.
  • Figure 4: Success Rate. Our method was evaluated across eight tasks. (a) The aggregated quantitative results show that our method has similar performance as diffusion policy for seen instances, but our method outperforms all baselines on unseen instances. (b) For the seen instances, diffusion policy and our method have similar performances on two simulation tasks. (c) Diffusion policy performance degrades significantly on unseen instances. In addition, our method outperforms all other baselines, which underscores our policy's capability to attend to geometric details, distinguish geometric ambiguities, and generalize to novel instances.
  • Figure 5: Policy Rollout in Real World. The figure illustrates the policy rollout results in the real world. On the left, the blue block displays the demonstration examples and corresponding training instances. On the right, the orange block presents policy rollout results. From left to right, they are, respectively, initial configurations, diffusion policy, diffusion policy with RGBD, ours without semantics, and our method. We summarize four common failure modes. Early failure and grasping failure could happen when the novel instance is presented. Diffusion policy may also lead to unsafe behavior when encountering novel instances. Ours without semantics might identify wrong directions due to geometric ambiguity and nuanced geometric details.
  • ...and 3 more figures