Table of Contents
Fetching ...

Self-Supervised Category-Level Articulated Object Pose Estimation with Part-Level SE(3) Equivariance

Xueyi Liu, Ji Zhang, Ruizhen Hu, Haibin Huang, He Wang, Li Yi

TL;DR

This work tackles self-supervised category-level articulated object pose estimation by introducing part-level $SE(3)$-equivariant features via a pose-aware convolution, enabling fine-grained disentanglement of canonical part shapes, object structure, and articulated pose. A shape reconstruction-based self-supervised loss ties the disentangled factors to the observed data, allowing automatic induction of category-aligned canonical spaces and per-part poses. Across complete and partial point clouds from synthetic and real datasets, the method achieves competitive or superior performance compared to supervised baselines and robust segmentation and reconstruction results, demonstrating the feasibility of annotation-free articulated pose understanding. The approach holds promise for reducing annotation burden and improving generalization in robotics and augmented reality tasks requiring articulated object manipulation and interaction.

Abstract

Category-level articulated object pose estimation aims to estimate a hierarchy of articulation-aware object poses of an unseen articulated object from a known category. To reduce the heavy annotations needed for supervised learning methods, we present a novel self-supervised strategy that solves this problem without any human labels. Our key idea is to factorize canonical shapes and articulated object poses from input articulated shapes through part-level equivariant shape analysis. Specifically, we first introduce the concept of part-level SE(3) equivariance and devise a network to learn features of such property. Then, through a carefully designed fine-grained pose-shape disentanglement strategy, we expect that canonical spaces to support pose estimation could be induced automatically. Thus, we could further predict articulated object poses as per-part rigid transformations describing how parts transform from their canonical part spaces to the camera space. Extensive experiments demonstrate the effectiveness of our method on both complete and partial point clouds from synthetic and real articulated object datasets.

Self-Supervised Category-Level Articulated Object Pose Estimation with Part-Level SE(3) Equivariance

TL;DR

This work tackles self-supervised category-level articulated object pose estimation by introducing part-level -equivariant features via a pose-aware convolution, enabling fine-grained disentanglement of canonical part shapes, object structure, and articulated pose. A shape reconstruction-based self-supervised loss ties the disentangled factors to the observed data, allowing automatic induction of category-aligned canonical spaces and per-part poses. Across complete and partial point clouds from synthetic and real datasets, the method achieves competitive or superior performance compared to supervised baselines and robust segmentation and reconstruction results, demonstrating the feasibility of annotation-free articulated pose understanding. The approach holds promise for reducing annotation burden and improving generalization in robotics and augmented reality tasks requiring articulated object manipulation and interaction.

Abstract

Category-level articulated object pose estimation aims to estimate a hierarchy of articulation-aware object poses of an unseen articulated object from a known category. To reduce the heavy annotations needed for supervised learning methods, we present a novel self-supervised strategy that solves this problem without any human labels. Our key idea is to factorize canonical shapes and articulated object poses from input articulated shapes through part-level equivariant shape analysis. Specifically, we first introduce the concept of part-level SE(3) equivariance and devise a network to learn features of such property. Then, through a carefully designed fine-grained pose-shape disentanglement strategy, we expect that canonical spaces to support pose estimation could be induced automatically. Thus, we could further predict articulated object poses as per-part rigid transformations describing how parts transform from their canonical part spaces to the camera space. Extensive experiments demonstrate the effectiveness of our method on both complete and partial point clouds from synthetic and real articulated object datasets.
Paper Structure (27 sections, 2 theorems, 8 equations, 11 figures, 11 tables)

This paper contains 27 sections, 2 theorems, 8 equations, 11 figures, 11 tables.

Key Result

Theorem 1

The continuous operation $(\mathcal{F} * h_1)(x_i, g) = \int_{x_j\in \mathbb{R}^{3}} \mathcal{F}(x_j, g\mathbf{R}_i\mathbf{R}_j^{-1})h_1(g(x_i - {P}_i{P}_j^{-1}x_j))$ is invariant to each arbitrary rigid transformation $\Delta {P}_j = (\Delta \mathbf{R}_j \in \text{SO(3)}, \Delta \mathbf{t}_j \in \m

Figures (11)

  • Figure 1: Overview of the proposed self-supervised articulated object pose estimation strategy. The method takes a complete or partial point cloud of an articulated object as input, factorizes canonical shapes, object structure, and the articulated object pose from it. The network is trained by a shape reconstruction task. Left: A high-level abstraction of our pipeline. Right: An illustrate of decomposed information for shape reconstruction. Green lines ($\leftarrow$) denote the iterative pose estimation process.
  • Figure 2: Visualization for qualitative evaluation. For every two lines, the first line draws the results of our method, and the second line draws those of NPCS. Every three shapes from the left side to the right side are the input point cloud (Input), reconstruction (Recon.), and the reconstructed canonical object shape (Canon.). We do not assume input shape alignment but align them here when drawing just for a better view. Please zoom in for details.
  • Figure 3: Overview of the proposed self-supervised articulated object pose estimation strategy. The method takes a complete or partial point cloud of an articulated object as input, factorizes canonical shapes, object structure, and the articulated object pose from it. The network is trained by a shape reconstruction task. Part-level SE(3) equivariant features are learned by iterating between part pose estimation and pose-aware equivariant point convolution. Green lines ($\leftarrow$) denote procedures for feeding the estimated part poses back to the pose-aware point convolution module.
  • Figure 4: Kinematic chain prediction procedure (an example of the object containing three parts).
  • Figure 5: The relationship between our three crucial spaces: the canonical part spaces, the canonical object space, and the camera space.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof