Table of Contents
Fetching ...

OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation

Yuchen Che, Ryo Furukawa, Asako Kanezaki

TL;DR

This work tackles category-level articulated pose estimation from a single-frame point cloud without pose or shape annotations. It introduces OP-Align, a self-supervised framework that first aligns the whole object to a canonical reconstruction at the object level and then aligns each part through joint-aware transformations, enabling real-time inference. A new real-world RGB-D dataset supports evaluation in practical settings, and experiments show state-of-the-art performance among self-supervised methods with competitive results versus supervised baselines on both synthetic and real data. The approach reduces pose variance via object-level alignment and leverages part-level alignment to capture joint movement, offering robust articulated pose estimation without costly annotations.

Abstract

Category-level articulated object pose estimation focuses on the pose estimation of unknown articulated objects within known categories. Despite its significance, this task remains challenging due to the varying shapes and poses of objects, expensive dataset annotation costs, and complex real-world environments. In this paper, we propose a novel self-supervised approach that leverages a single-frame point cloud to solve this task. Our model consistently generates reconstruction with a canonical pose and joint state for the entire input object, and it estimates object-level poses that reduce overall pose variance and part-level poses that align each part of the input with its corresponding part of the reconstruction. Experimental results demonstrate that our approach significantly outperforms previous self-supervised methods and is comparable to the state-of-the-art supervised methods. To assess the performance of our model in real-world scenarios, we also introduce a new real-world articulated object benchmark dataset.

OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation

TL;DR

This work tackles category-level articulated pose estimation from a single-frame point cloud without pose or shape annotations. It introduces OP-Align, a self-supervised framework that first aligns the whole object to a canonical reconstruction at the object level and then aligns each part through joint-aware transformations, enabling real-time inference. A new real-world RGB-D dataset supports evaluation in practical settings, and experiments show state-of-the-art performance among self-supervised methods with competitive results versus supervised baselines on both synthetic and real data. The approach reduces pose variance via object-level alignment and leverages part-level alignment to capture joint movement, offering robust articulated pose estimation without costly annotations.

Abstract

Category-level articulated object pose estimation focuses on the pose estimation of unknown articulated objects within known categories. Despite its significance, this task remains challenging due to the varying shapes and poses of objects, expensive dataset annotation costs, and complex real-world environments. In this paper, we propose a novel self-supervised approach that leverages a single-frame point cloud to solve this task. Our model consistently generates reconstruction with a canonical pose and joint state for the entire input object, and it estimates object-level poses that reduce overall pose variance and part-level poses that align each part of the input with its corresponding part of the reconstruction. Experimental results demonstrate that our approach significantly outperforms previous self-supervised methods and is comparable to the state-of-the-art supervised methods. To assess the performance of our model in real-world scenarios, we also introduce a new real-world articulated object benchmark dataset.
Paper Structure (18 sections, 11 equations, 9 figures, 5 tables)

This paper contains 18 sections, 11 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overview of works on self-supervised category-level articulated object pose estimation.
  • Figure 2: Pipeline of OP-Align. At the object-level phase, for the input point cloud $\mathbf{X}$, we use the E2PN e2pn backbone to predict and select object-level pose $\mathbf{R}_{\mathrm{o}}, \mathbf{t}_{\mathrm{o}}$ from pose candidates, and generate the canonical reconstruction $\mathbf{Y}$ by adding a learnable parameter called category-common base shape $\mathbf{Y}_{\mathrm{base}}$. At the part-level phase, two PointNets PointNet with shared weights predict the part segmentation probability $\mathbf{W}_{\mathrm{x}}, \mathbf{W}_{\mathrm{y}}$, joint states $\mathbf{a}_{\mathrm{x}}, \mathbf{a}_{\mathrm{y}}$, joint pivots $\mathbf{c}_{\mathrm{x}}, \mathbf{c}_{\mathrm{y}}$, and joint directions $\mathbf{d}_{\mathrm{x}}, \mathbf{d}_{\mathrm{y}}$ for object-level aligned input $\mathbf{R}_{\mathrm{o}}\mathbf{X} + \mathbf{t}_{\mathrm{o}}$ and reconstruction $\mathbf{Y}$, to generate part-level alignment $\mathbf{R}_{\mathrm{d}}, \mathbf{R}_{\mathrm{a}}, \mathbf{T}_{\mathrm{a}}$ that aligns each part of $\mathbf{X}$ to the corresponding part of $\mathbf{Y}$ as part-level aligned inputs $\mathbf{Z}$.
  • Figure 3: Illustration of the object-level alignment, part-level alignment, and the reconstruction of two inputs (a) and (b). Object-level alignment aligns the inputs with the canonical reconstructions holistically. Part-level alignment simulates joint movement to align each part. The category-common base shape remains consistent for all inputs, and the canonical reconstruction further fits the shape details of each input.
  • Figure 4: Illustration of joint direction alignment $\mathbf{R}_{\mathrm{d}}$, joint state alignment $\mathbf{R}_{\mathrm{a}}$ that simulating revolute joint movement, and $\mathbf{t}_{\mathrm{a}}$ that simulating prismatic joint movement.
  • Figure 5: Example of object point cloud in the real-world dataset. We use RGB-D images and object segmentation masks to back-project object point cloud.
  • ...and 4 more figures