DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces
Li Zhang, Mingyu Mei, Ailing Wang, Xianhui Meng, Yan Zhong, Xinyuan Song, Liu Liu, Rujing Wang, Zaixing He, Cewu Lu
TL;DR
DICArt addresses category-level articulated pose estimation by casting pose inference as a conditional discrete diffusion process over per-part pose tokens, incorporating a Flowing Mechanism and Flexible Flow Decider to enable gentle, adaptive denoising. It further enforces physical plausibility through hierarchical kinematic coupling, distinguishing Parent and Child parts and predicting joint axes with orthogonality constraints. The method discretizes pose into token bins, uses a block-diagonal transition to maintain token-type consistency, and augments the state with a [MASK] token to improve stability. Experiments on synthetic, semi-synthetic, and real-world datasets demonstrate SOTA or near-SOTA performance, with strong generalization across domains and robustness to self-occlusion. Overall, DICArt introduces a discrete-diffusion framework that integrates structural priors to advance reliable category-level 6D pose estimation for articulated objects.
Abstract
Articulated object pose estimation is a core task in embodied AI. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this work, we introduce DICArt (DIsCrete Diffusion for Articulation Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the GT pose. To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a hierarchical kinematic coupling strategy, estimating the pose of each rigid part hierarchically to respect the object's kinematic structure. We validate DICArt on both synthetic and real-world datasets. Experimental results demonstrate its superior performance and robustness. By integrating discrete generative modeling with structural priors, DICArt offers a new paradigm for reliable category-level 6D pose estimation in complex environments.
