Table of Contents
Fetching ...

MonoSE(3)-Diffusion: A Monocular SE(3) Diffusion Framework for Robust Camera-to-Robot Pose Estimation

Kangjian Zhu, Haobo Jiang, Yigong Zhang, Jianjun Qian, Jian Yang, Jin Xie

TL;DR

This work addresses robust monocular camera-to-robot pose estimation by formulating it as a conditional SE(3) denoising diffusion problem conditioned on the robot model and input image. It introduces two key components: a visibility-constrained diffusion (VisDiff) that guarantees in-view, diverse training poses within the camera frustum, and a timestep-aware reverse diffusion (RevDiff) that progressively refines poses via DDIM sampling and a rendering-based denoiser. The method uses a monocular-normalized SE(3) formulation to decouple rotation and translation and to maintain invariance to camera intrinsics, enabling stable diffusion in $SO(3)$ and frustum-bounded translations. Empirically, MonoSE(3)-Diffusion achieves state-of-the-art results on DREAM and RoboKeyGen benchmarks, notably reaching an AUC of $66.75$ on AzureKinect-Franka and delivering a $32.3\%$ improvement over the best baselines, demonstrating robustness under challenging, low-visibility scenarios and suggesting strong practical impact for real-world robotic calibration and manipulation tasks.

Abstract

We propose MonoSE(3)-Diffusion, a monocular SE(3) diffusion framework that formulates markerless, image-based robot pose estimation as a conditional denoising diffusion process. The framework consists of two processes: a visibility-constrained diffusion process for diverse pose augmentation and a timestep-aware reverse process for progressive pose refinement. The diffusion process progressively perturbs ground-truth poses to noisy transformations for training a pose denoising network. Importantly, we integrate visibility constraints into the process, ensuring the transformations remain within the camera field of view. Compared to the fixed-scale perturbations used in current methods, the diffusion process generates in-view and diverse training poses, thereby improving the network generalization capability. Furthermore, the reverse process iteratively predicts the poses by the denoising network and refines pose estimates by sampling from the diffusion posterior of current timestep, following a scheduled coarse-to-fine procedure. Moreover, the timestep indicates the transformation scales, which guide the denoising network to achieve more accurate pose predictions. The reverse process demonstrates higher robustness than direct prediction, benefiting from its timestep-aware refinement scheme. Our approach demonstrates improvements across two benchmarks (DREAM and RoboKeyGen), achieving a notable AUC of 66.75 on the most challenging dataset, representing a 32.3% gain over the state-of-the-art.

MonoSE(3)-Diffusion: A Monocular SE(3) Diffusion Framework for Robust Camera-to-Robot Pose Estimation

TL;DR

This work addresses robust monocular camera-to-robot pose estimation by formulating it as a conditional SE(3) denoising diffusion problem conditioned on the robot model and input image. It introduces two key components: a visibility-constrained diffusion (VisDiff) that guarantees in-view, diverse training poses within the camera frustum, and a timestep-aware reverse diffusion (RevDiff) that progressively refines poses via DDIM sampling and a rendering-based denoiser. The method uses a monocular-normalized SE(3) formulation to decouple rotation and translation and to maintain invariance to camera intrinsics, enabling stable diffusion in and frustum-bounded translations. Empirically, MonoSE(3)-Diffusion achieves state-of-the-art results on DREAM and RoboKeyGen benchmarks, notably reaching an AUC of on AzureKinect-Franka and delivering a improvement over the best baselines, demonstrating robustness under challenging, low-visibility scenarios and suggesting strong practical impact for real-world robotic calibration and manipulation tasks.

Abstract

We propose MonoSE(3)-Diffusion, a monocular SE(3) diffusion framework that formulates markerless, image-based robot pose estimation as a conditional denoising diffusion process. The framework consists of two processes: a visibility-constrained diffusion process for diverse pose augmentation and a timestep-aware reverse process for progressive pose refinement. The diffusion process progressively perturbs ground-truth poses to noisy transformations for training a pose denoising network. Importantly, we integrate visibility constraints into the process, ensuring the transformations remain within the camera field of view. Compared to the fixed-scale perturbations used in current methods, the diffusion process generates in-view and diverse training poses, thereby improving the network generalization capability. Furthermore, the reverse process iteratively predicts the poses by the denoising network and refines pose estimates by sampling from the diffusion posterior of current timestep, following a scheduled coarse-to-fine procedure. Moreover, the timestep indicates the transformation scales, which guide the denoising network to achieve more accurate pose predictions. The reverse process demonstrates higher robustness than direct prediction, benefiting from its timestep-aware refinement scheme. Our approach demonstrates improvements across two benchmarks (DREAM and RoboKeyGen), achieving a notable AUC of 66.75 on the most challenging dataset, representing a 32.3% gain over the state-of-the-art.

Paper Structure

This paper contains 16 sections, 10 equations, 6 figures, 3 tables, 2 algorithms.

Figures (6)

  • Figure 1: Comparison of camera-to-robot pose estimation process between our proposed diffusion-based method and regression-based method. (a) MonoSE(3)-Diffusion leverages a diffusion process to generate diverse training poses, and a scheduled reverse diffusion process to achieve coarse-to-fine pose estimation, conditioned on the robot model and the input image. (b) Conventional iterative regression suffers from inaccurate prediction and premature convergence. Note: The poses are visualized using edges, representing the mask boundaries of the rendered view under the poses.
  • Figure 2: Diffusion process. The visibility-constrained diffusion process is specifically designed to satisfy the viewing frustum constraints, and is decoupled into centroid-based rotation, translation in the image plane, and translation along the optical axis. It takes the ground-truth pose $\mathbf{H}_0$ and the image as inputs, generating a noisy pose $\mathbf{H}_t$ along with a sampled time step $t$. The projections of noisy poses are bounded by the camera viewing frustum. Reverse process. The timestep-aware reverse process is driven by DDIM sampling and a pose denoising network. The network represents the noisy pose as a rendered view, which is concatenated with the cropped image and the embedded time step before being processed by a ResNet. Subsequently, a pose update strategy is employed to get the pose prediction $\hat{\mathbf{H}}_0$, which is obtained by DDIM sampling to generate a progressively denoised pose $\mathbf{H}_{t-1}$ for the next iteration.
  • Figure 3: Visibility constraints in monocular images. (a) Rotation around the camera's optical center (defined in the camera coordinate frame) can easily transform the original pose out of the camera's field of view. (b) The same translation can lead to different shifts in the image plane, depending on focal length (left) and image size (right), making it difficult to define consistent diffusion parameters for 3D translation.
  • Figure 4: Visualization of pose estimation results and processes.
  • Figure 5: Distribution of ADD scores on RoboKeyGen benchmark.
  • ...and 1 more figures