Table of Contents
Fetching ...

DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces

Li Zhang, Mingyu Mei, Ailing Wang, Xianhui Meng, Yan Zhong, Xinyuan Song, Liu Liu, Rujing Wang, Zaixing He, Cewu Lu

TL;DR

DICArt addresses category-level articulated pose estimation by casting pose inference as a conditional discrete diffusion process over per-part pose tokens, incorporating a Flowing Mechanism and Flexible Flow Decider to enable gentle, adaptive denoising. It further enforces physical plausibility through hierarchical kinematic coupling, distinguishing Parent and Child parts and predicting joint axes with orthogonality constraints. The method discretizes pose into token bins, uses a block-diagonal transition to maintain token-type consistency, and augments the state with a [MASK] token to improve stability. Experiments on synthetic, semi-synthetic, and real-world datasets demonstrate SOTA or near-SOTA performance, with strong generalization across domains and robustness to self-occlusion. Overall, DICArt introduces a discrete-diffusion framework that integrates structural priors to advance reliable category-level 6D pose estimation for articulated objects.

Abstract

Articulated object pose estimation is a core task in embodied AI. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this work, we introduce DICArt (DIsCrete Diffusion for Articulation Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the GT pose. To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a hierarchical kinematic coupling strategy, estimating the pose of each rigid part hierarchically to respect the object's kinematic structure. We validate DICArt on both synthetic and real-world datasets. Experimental results demonstrate its superior performance and robustness. By integrating discrete generative modeling with structural priors, DICArt offers a new paradigm for reliable category-level 6D pose estimation in complex environments.

DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces

TL;DR

DICArt addresses category-level articulated pose estimation by casting pose inference as a conditional discrete diffusion process over per-part pose tokens, incorporating a Flowing Mechanism and Flexible Flow Decider to enable gentle, adaptive denoising. It further enforces physical plausibility through hierarchical kinematic coupling, distinguishing Parent and Child parts and predicting joint axes with orthogonality constraints. The method discretizes pose into token bins, uses a block-diagonal transition to maintain token-type consistency, and augments the state with a [MASK] token to improve stability. Experiments on synthetic, semi-synthetic, and real-world datasets demonstrate SOTA or near-SOTA performance, with strong generalization across domains and robustness to self-occlusion. Overall, DICArt introduces a discrete-diffusion framework that integrates structural priors to advance reliable category-level 6D pose estimation for articulated objects.

Abstract

Articulated object pose estimation is a core task in embodied AI. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this work, we introduce DICArt (DIsCrete Diffusion for Articulation Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the GT pose. To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a hierarchical kinematic coupling strategy, estimating the pose of each rigid part hierarchically to respect the object's kinematic structure. We validate DICArt on both synthetic and real-world datasets. Experimental results demonstrate its superior performance and robustness. By integrating discrete generative modeling with structural priors, DICArt offers a new paradigm for reliable category-level 6D pose estimation in complex environments.
Paper Structure (13 sections, 6 equations, 4 figures, 4 tables)

This paper contains 13 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of Different Denoising Processes. We denote the rotation-related Euler angles as $l, m, n$, and model them using discretized bin indices for prediction. (a) illustrates the vanilla denoising process of conventional discrete diffusion models, where inconsistent convergence rates across tokens often introduce uncertainty and ambiguity in pose prediction—this can be viewed as an aggressive denoising strategy. (b) presents the reformulated denoising process proposed in this work, which is centered around a customized Flowing Mechanism. This mechanism introduces adaptive directional guidance that determines appropriate update paths for each token. It is designed to enforce consistent convergence trajectories among semantically correlated token groups, thereby enabling a more stable and gentle denoising process. Note that a rigid part (bottle) is chosen to illustrate the process for simplicity.
  • Figure 2: The Pipeline of Our DICArt. Please note that images of the object with varying saturation levels represent different degrees of noise in the pose annotations.
  • Figure 3: Qualitative Results on the Synthetic Dataset (left) and RGB-D Images Dataset (right).
  • Figure 4: Qualitative Results on 7-part RobotArm Dataset.