Self-Supervised Category-Level Articulated Object Pose Estimation with Part-Level SE(3) Equivariance
Xueyi Liu, Ji Zhang, Ruizhen Hu, Haibin Huang, He Wang, Li Yi
TL;DR
This work tackles self-supervised category-level articulated object pose estimation by introducing part-level $SE(3)$-equivariant features via a pose-aware convolution, enabling fine-grained disentanglement of canonical part shapes, object structure, and articulated pose. A shape reconstruction-based self-supervised loss ties the disentangled factors to the observed data, allowing automatic induction of category-aligned canonical spaces and per-part poses. Across complete and partial point clouds from synthetic and real datasets, the method achieves competitive or superior performance compared to supervised baselines and robust segmentation and reconstruction results, demonstrating the feasibility of annotation-free articulated pose understanding. The approach holds promise for reducing annotation burden and improving generalization in robotics and augmented reality tasks requiring articulated object manipulation and interaction.
Abstract
Category-level articulated object pose estimation aims to estimate a hierarchy of articulation-aware object poses of an unseen articulated object from a known category. To reduce the heavy annotations needed for supervised learning methods, we present a novel self-supervised strategy that solves this problem without any human labels. Our key idea is to factorize canonical shapes and articulated object poses from input articulated shapes through part-level equivariant shape analysis. Specifically, we first introduce the concept of part-level SE(3) equivariance and devise a network to learn features of such property. Then, through a carefully designed fine-grained pose-shape disentanglement strategy, we expect that canonical spaces to support pose estimation could be induced automatically. Thus, we could further predict articulated object poses as per-part rigid transformations describing how parts transform from their canonical part spaces to the camera space. Extensive experiments demonstrate the effectiveness of our method on both complete and partial point clouds from synthetic and real articulated object datasets.
