Table of Contents
Fetching ...

SPLIT: SE(3)-diffusion via Local Geometry-based Score Prediction for 3D Scene-to-Pose-Set Matching Problems

Kanghyun Kim, Min Jun Kim

Abstract

To enable versatile robot manipulation, robots must detect task-relevant poses for different purposes from raw scenes. Currently, many perception algorithms are designed for specific purposes, which limits the flexibility of the perception module. We present a general problem formulation called 3D scene-to-pose-set matching, which directly matches the corresponding poses from the scene without relying on task-specific heuristics. To address this, we introduce SPLIT, an SE(3)-diffusion model for generating pose samples from a scene. The model's efficiency comes from predicting scores based on local geometry with respect to the sample pose. Moreover, leveraging the conditioned generation capability of diffusion models, we demonstrate that SPLIT can generate the multi-purpose poses, required to complete both the mug reorientation and hanging manipulation within a single model.

SPLIT: SE(3)-diffusion via Local Geometry-based Score Prediction for 3D Scene-to-Pose-Set Matching Problems

Abstract

To enable versatile robot manipulation, robots must detect task-relevant poses for different purposes from raw scenes. Currently, many perception algorithms are designed for specific purposes, which limits the flexibility of the perception module. We present a general problem formulation called 3D scene-to-pose-set matching, which directly matches the corresponding poses from the scene without relying on task-specific heuristics. To address this, we introduce SPLIT, an SE(3)-diffusion model for generating pose samples from a scene. The model's efficiency comes from predicting scores based on local geometry with respect to the sample pose. Moreover, leveraging the conditioned generation capability of diffusion models, we demonstrate that SPLIT can generate the multi-purpose poses, required to complete both the mug reorientation and hanging manipulation within a single model.

Paper Structure

This paper contains 18 sections, 13 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Multiple pose descriptors and grasp detections are required to execute the mug reorientation and hanging task. The robot must determine a stable pose for upright placement, the handle direction for hanging, and grasp candidates for picking. The figure illustrates the subgoal configurations and corresponding poses needed to solve the task: (a) pick-reorient, (b) pick-hang.
  • Figure 2: A visual explanation of the spatial locality assumption of our problem. For example, in a grasp detection task, the success probability at a given pose is induced by the local geometric context surrounding it.
  • Figure 3: The network architecture of SPLIT. To extract the local geometric context ${}^{\textit{H}}\bm{z}$ from the multi-scale feature grids $\textit{Z}$, a point kernel is transformed by the sample pose to interpolate the local features at the points.
  • Figure 4: Packed and pile scenes from breyer2021volumetric, along with grasp generation results. A point cloud is obtained from a single-view depth image and converted into an occupancy grid for input.
  • Figure 5: Two real-world examples of multi-purpose pose detection with the highest probability are presented. (a) The mug is positioned in the canonical pose. (b) Even though the mug’s handle is not visible in the point cloud, SPLIT infers implicit information (i.e., the handle is located on the hidden side in the stable pose) from the scene-pose dataset.