CAPT: Category-level Articulation Estimation from a Single Point Cloud Using Transformer

Lian Fu; Ryoichi Ishikawa; Yoshihiro Sato; Takeshi Oishi

CAPT: Category-level Articulation Estimation from a Single Point Cloud Using Transformer

Lian Fu, Ryoichi Ishikawa, Yoshihiro Sato, Takeshi Oishi

TL;DR

CAPT addresses category-level articulation estimation from a single point cloud by employing an end-to-end Transformer-based framework with a four-stage encoder and multi-branch decoders. It introduces a motion loss to recover dynamic features and a coarse-to-fine double voting scheme to robustly estimate joint parameters, achieving state-of-the-art performance on synthetic category datasets. The method optimizes a composite loss that combines segmentation, direction, pivot, distance, state, and motion terms, and demonstrates strong generalization to real-world scenes in sim-to-real experiments. This work showcases the viability of Transformer architectures for articulated object analysis and control, enabling accurate, end-to-end estimation without post-optimization or multi-stage pipelines.

Abstract

The ability to estimate joint parameters is essential for various applications in robotics and computer vision. In this paper, we propose CAPT: category-level articulation estimation from a point cloud using Transformer. CAPT uses an end-to-end transformer-based architecture for joint parameter and state estimation of articulated objects from a single point cloud. The proposed CAPT methods accurately estimate joint parameters and states for various articulated objects with high precision and robustness. The paper also introduces a motion loss approach, which improves articulation estimation performance by emphasizing the dynamic features of articulated objects. Additionally, the paper presents a double voting strategy to provide the framework with coarse-to-fine parameter estimation. Experimental results on several category datasets demonstrate that our methods outperform existing alternatives for articulation estimation. Our research provides a promising solution for applying Transformer-based architectures in articulated object analysis.

CAPT: Category-level Articulation Estimation from a Single Point Cloud Using Transformer

TL;DR

Abstract

Paper Structure (20 sections, 17 equations, 6 figures, 2 tables)

This paper contains 20 sections, 17 equations, 6 figures, 2 tables.

Introduction
Related Work
Articulation Estimation
Transformer for Point Clouds
Category-level Articulation Estimation Framework
Problem Formulation
Input embedding and Encoder
Multi-branch Decoders
Loss Design and Optimization
Motion Loss
Total Loss
Double Voting
Experiments
Datasets
Baselines
...and 5 more sections

Figures (6)

Figure 1: Articulation estimation aims to estimate joint parameters and states from visual information. In our case, we propose to infer from only a single static point cloud. This task could be applied in virtual/augmented reality, robot interaction, etc.
Figure 2: Our CAPT (category-level articulation estimation from a point cloud using Transformer) architecture. $n$ is the number of points, $d$ is the origin feature dimension of each point and $d_\mathrm{e}$ is the embedded feature dimension. In the output, as in all figures in this paper, the red arrow represents the predicted joint, while the green arrow represents the ground truth joint.
Figure 3: Diagram of motion loss calculation. Motion loss of $k^{th}$ joint is calculated in two steps. (1) Move: Moving the part point cloud $P_k$ along predicted joint $\hat{J}_k$ and ground truth joint $J_k$ to obtain rotated point clouds $\hat{P}'_k$ and $P'_k$, respectively. (2) Compare: Get motion loss by comparing $\hat{P}'_k$ and $P'_k$. The total motion loss is the sum of each joint's motion loss.
Figure 4: Comparisons between (a) ANCSH (b) PCT (c) CAPT-plain (CAPT without double voting and motion loss) (d) CAPT without double voting and (e) CAPT with double voting. Here the object has thin arms, which can make naive PCT unmanageable. On the other hand, our method successfully predicted joint parameter values with relatively high accuracy whether or not double voting was used. Double voting yielded an even better result.
Figure 5: Direct sim-to-reality result. (a) Real scene, (b) extracted point cloud, (c) naive PCT, (d) without motion loss, (e) without double voting, and (f) our methods. The results indicate that our category-level articulation estimation from a single point cloud using Transformer (CAPT) methods successfully captured the category features of noisy real-world articulated objects despite being trained with only a synthetic dataset.
...and 1 more figures

CAPT: Category-level Articulation Estimation from a Single Point Cloud Using Transformer

TL;DR

Abstract

CAPT: Category-level Articulation Estimation from a Single Point Cloud Using Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (6)