Table of Contents
Fetching ...

Diffusion-Driven Self-Supervised Learning for Shape Reconstruction and Pose Estimation

Jingtao Sun, Yaonan Wang, Mingtao Feng, Chao Ding, Mike Zheng Shou, Ajmal Saeed Mian

TL;DR

This work tackles category-level $6$-DoF pose estimation and multi-object 3D shape reconstruction without ground-truth poses or CAD models by leveraging only shape priors. It introduces a diffusion-driven self-supervised framework built around a $SE(3)$-equivariant Prior-Aware Pyramid 3D Point Transformer that learns pose-aware features and $3D$ scale-invariant shapes. A two-phase Pretrain-to-Refine training paradigm steers learning from priors to observations, addressing intra-class variation through diffusion guidance. Extensive experiments on REAL275, CAMERA25, Wild6D, YCB-Video and a dynamic dataset show state-of-the-art performance in self-supervised category-level pose estimation and competitive results against fully-supervised baselines, with improved 3D shape reconstruction. The approach reduces labeling costs and demonstrates strong cross-dataset generalization, suggesting meaningful impact for robotics and AR applications.

Abstract

Fully-supervised category-level pose estimation aims to determine the 6-DoF poses of unseen instances from known categories, requiring expensive mannual labeling costs. Recently, various self-supervised category-level pose estimation methods have been proposed to reduce the requirement of the annotated datasets. However, most methods rely on synthetic data or 3D CAD model for self-supervised training, and they are typically limited to addressing single-object pose problems without considering multi-objective tasks or shape reconstruction. To overcome these challenges and limitations, we introduce a diffusion-driven self-supervised network for multi-object shape reconstruction and categorical pose estimation, only leveraging the shape priors. Specifically, to capture the SE(3)-equivariant pose features and 3D scale-invariant shape information, we present a Prior-Aware Pyramid 3D Point Transformer in our network. This module adopts a point convolutional layer with radial-kernels for pose-aware learning and a 3D scale-invariant graph convolution layer for object-level shape representation, respectively. Furthermore, we introduce a pretrain-to-refine self-supervised training paradigm to train our network. It enables proposed network to capture the associations between shape priors and observations, addressing the challenge of intra-class shape variations by utilising the diffusion mechanism. Extensive experiments conducted on four public datasets and a self-built dataset demonstrate that our method significantly outperforms state-of-the-art self-supervised category-level baselines and even surpasses some fully-supervised instance-level and category-level methods.

Diffusion-Driven Self-Supervised Learning for Shape Reconstruction and Pose Estimation

TL;DR

This work tackles category-level -DoF pose estimation and multi-object 3D shape reconstruction without ground-truth poses or CAD models by leveraging only shape priors. It introduces a diffusion-driven self-supervised framework built around a -equivariant Prior-Aware Pyramid 3D Point Transformer that learns pose-aware features and scale-invariant shapes. A two-phase Pretrain-to-Refine training paradigm steers learning from priors to observations, addressing intra-class variation through diffusion guidance. Extensive experiments on REAL275, CAMERA25, Wild6D, YCB-Video and a dynamic dataset show state-of-the-art performance in self-supervised category-level pose estimation and competitive results against fully-supervised baselines, with improved 3D shape reconstruction. The approach reduces labeling costs and demonstrates strong cross-dataset generalization, suggesting meaningful impact for robotics and AR applications.

Abstract

Fully-supervised category-level pose estimation aims to determine the 6-DoF poses of unseen instances from known categories, requiring expensive mannual labeling costs. Recently, various self-supervised category-level pose estimation methods have been proposed to reduce the requirement of the annotated datasets. However, most methods rely on synthetic data or 3D CAD model for self-supervised training, and they are typically limited to addressing single-object pose problems without considering multi-objective tasks or shape reconstruction. To overcome these challenges and limitations, we introduce a diffusion-driven self-supervised network for multi-object shape reconstruction and categorical pose estimation, only leveraging the shape priors. Specifically, to capture the SE(3)-equivariant pose features and 3D scale-invariant shape information, we present a Prior-Aware Pyramid 3D Point Transformer in our network. This module adopts a point convolutional layer with radial-kernels for pose-aware learning and a 3D scale-invariant graph convolution layer for object-level shape representation, respectively. Furthermore, we introduce a pretrain-to-refine self-supervised training paradigm to train our network. It enables proposed network to capture the associations between shape priors and observations, addressing the challenge of intra-class shape variations by utilising the diffusion mechanism. Extensive experiments conducted on four public datasets and a self-built dataset demonstrate that our method significantly outperforms state-of-the-art self-supervised category-level baselines and even surpasses some fully-supervised instance-level and category-level methods.
Paper Structure (26 sections, 30 equations, 9 figures, 11 tables, 2 algorithms)

This paper contains 26 sections, 30 equations, 9 figures, 11 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overview. Different from most existing NOCS-map paradigm, that employs the one-shot pipeline to normalize objects into a 3D NOCS space contained within a unit cube and aligns their centers or orientations within the same category, our proposed pretrain-to-refine paradigm adopts a two-phase strategy. We first train a basic network model using the prior shapes, and then fine-tune this pre-trained base model under the guidance of the shape/observation latent representations. The entire process is implemented in a self-supervised manner driven by the diffusion mechanism.
  • Figure 2: Illustration our network and pretrain-to-refine self-supervised training paradigm. Our network consists of three components, taking only the observable point cloud ${P_0}$ in the current frame and the corresponding shape priors ${P_r}$ as input. Prior-Aware Pyramid 3D Point Transformer as the core network framework for self-supervised learning. During the pre-training phase, we first utilize a simplified version of our Prior-Aware Pyramid 3D Point Transformer with our ORT Deconv to establish a rough base pre-trained model, leveraging the guidance of priori shapes. After that, utilizing the shape/observation latent embedding $f$, we fine-tune this pre-trained model to conduct a comprehensive reinforced model in the subsequent refinement phase. Utimately, the reinforced model is employed to determine the 6-DoF poses, 3D scales and finer canonical shapes.
  • Figure 3: Illustration of the ORT in our proposed Prior-Aware Pyramid 3D Point Transformer.(a) The proposed SE(3) block is equipped with an ability to encode shape-observation similarity based on proposed SE(3)-equivariant and 3D scale-invariant learning. (b) The details of our proposed 3D Transformer with multi-head attention, where $h$ is the number of heads. (c) The process of the point convolution layer with radial-kernels, in which the kernel points are generated from the radial mapping on the surface of the sphere domain. (d) The 3D scale-invariant graph convolution layer, that distinguishes itself from other graph convolutions by exhibiting the scale-invariant property through the formation of a graph involving the central point and all radial kernel points.
  • Figure 4: The directed graphical model of diffusion process for canonical shapes reconstruction.$x_i^{(T)}$ and ${P_r}$ are initial noise points and shape prior points. Each sampled noise points $x_i^{(t)}$ at per timestep contain N points. Related dynamic video can be found in the project page.
  • Figure 5: Comparison of mAP on both CAMERA25 and REAL275 datasets. Mean average percision (mAP) of our method and baseline for various 3D IoU, rotation and translation thresholds on CAMERA25 and REAL275 datasets. The upper row (i.e., (a) and (b)) is the results of NOCS and all results are from original paper wang2019normalized, and the bottom row (i.e., (c) and (d)) is ours.
  • ...and 4 more figures