CAPT: Category-level Articulation Estimation from a Single Point Cloud Using Transformer
Lian Fu, Ryoichi Ishikawa, Yoshihiro Sato, Takeshi Oishi
TL;DR
CAPT addresses category-level articulation estimation from a single point cloud by employing an end-to-end Transformer-based framework with a four-stage encoder and multi-branch decoders. It introduces a motion loss to recover dynamic features and a coarse-to-fine double voting scheme to robustly estimate joint parameters, achieving state-of-the-art performance on synthetic category datasets. The method optimizes a composite loss that combines segmentation, direction, pivot, distance, state, and motion terms, and demonstrates strong generalization to real-world scenes in sim-to-real experiments. This work showcases the viability of Transformer architectures for articulated object analysis and control, enabling accurate, end-to-end estimation without post-optimization or multi-stage pipelines.
Abstract
The ability to estimate joint parameters is essential for various applications in robotics and computer vision. In this paper, we propose CAPT: category-level articulation estimation from a point cloud using Transformer. CAPT uses an end-to-end transformer-based architecture for joint parameter and state estimation of articulated objects from a single point cloud. The proposed CAPT methods accurately estimate joint parameters and states for various articulated objects with high precision and robustness. The paper also introduces a motion loss approach, which improves articulation estimation performance by emphasizing the dynamic features of articulated objects. Additionally, the paper presents a double voting strategy to provide the framework with coarse-to-fine parameter estimation. Experimental results on several category datasets demonstrate that our methods outperform existing alternatives for articulation estimation. Our research provides a promising solution for applying Transformer-based architectures in articulated object analysis.
