Table of Contents
Fetching ...

Joint Learning of Pose Regression and Denoising Diffusion with Score Scaling Sampling for Category-level 6D Pose Estimation

Seunghyun Lee, Tae-Kyun Kim

TL;DR

This work tackles category-level 6D pose estimation by addressing diffusion-based training inefficiencies and the need for post-hoc pose filtering. It introduces a joint learning framework that pre-trains an encoder with direct pose regression and then jointly trains the regression head with a diffusion denoising head, significantly accelerating convergence. A time-dependent score-scaling guidance is proposed to steer diffusion samples toward high-density regions, preserving multi-modal symmetry (especially for symmetric objects) while enabling high-quality single-sample inference without extra evaluation networks. Across REAL275, HouseCat6D, and ROPE benchmarks, the method achieves state-of-the-art results with notable efficiency gains in both training and inference, demonstrating strong practical impact for real-world 3D perception systems.

Abstract

Latest diffusion models have shown promising results in category-level 6D object pose estimation by modeling the conditional pose distribution with depth image input. The existing methods, however, suffer from slow convergence during training, learning its encoder with the diffusion denoising network in end-to-end fashion, and require an additional network that evaluates sampled pose hypotheses to filter out low-quality pose candidates. In this paper, we propose a novel pipeline that tackles these limitations by two key components. First, the proposed method pretrains the encoder with the direct pose regression head, and jointly learns the networks via the regression head and the denoising diffusion head, significantly accelerating training convergence while achieving higher accuracy. Second, sampling guidance via time-dependent score scaling is proposed s.t. the exploration-exploitation trade-off is effectively taken, eliminating the need for the additional evaluation network. The sampling guidance maintains multi-modal characteristics of symmetric objects at early denoising steps while ensuring high-quality pose generation at final steps. Extensive experiments on multiple benchmarks including REAL275, HouseCat6D, and ROPE, demonstrate that the proposed method, simple yet effective, achieves state-of-the-art accuracies even with single-pose inference, while being more efficient in both training and inference.

Joint Learning of Pose Regression and Denoising Diffusion with Score Scaling Sampling for Category-level 6D Pose Estimation

TL;DR

This work tackles category-level 6D pose estimation by addressing diffusion-based training inefficiencies and the need for post-hoc pose filtering. It introduces a joint learning framework that pre-trains an encoder with direct pose regression and then jointly trains the regression head with a diffusion denoising head, significantly accelerating convergence. A time-dependent score-scaling guidance is proposed to steer diffusion samples toward high-density regions, preserving multi-modal symmetry (especially for symmetric objects) while enabling high-quality single-sample inference without extra evaluation networks. Across REAL275, HouseCat6D, and ROPE benchmarks, the method achieves state-of-the-art results with notable efficiency gains in both training and inference, demonstrating strong practical impact for real-world 3D perception systems.

Abstract

Latest diffusion models have shown promising results in category-level 6D object pose estimation by modeling the conditional pose distribution with depth image input. The existing methods, however, suffer from slow convergence during training, learning its encoder with the diffusion denoising network in end-to-end fashion, and require an additional network that evaluates sampled pose hypotheses to filter out low-quality pose candidates. In this paper, we propose a novel pipeline that tackles these limitations by two key components. First, the proposed method pretrains the encoder with the direct pose regression head, and jointly learns the networks via the regression head and the denoising diffusion head, significantly accelerating training convergence while achieving higher accuracy. Second, sampling guidance via time-dependent score scaling is proposed s.t. the exploration-exploitation trade-off is effectively taken, eliminating the need for the additional evaluation network. The sampling guidance maintains multi-modal characteristics of symmetric objects at early denoising steps while ensuring high-quality pose generation at final steps. Extensive experiments on multiple benchmarks including REAL275, HouseCat6D, and ROPE, demonstrate that the proposed method, simple yet effective, achieves state-of-the-art accuracies even with single-pose inference, while being more efficient in both training and inference.

Paper Structure

This paper contains 35 sections, 14 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: (a) Joint learning phase: The encoder $\mathcal{E}$ and regression head $\mathcal{R}_{\phi}$ are first pre-trained on target data, then jointly trained with the diffusion head $\mathcal{D}_{\theta}$. (b) Inference phase: Single pose sampling using score scaling guidance $w_t$ to update scores at each timestep. (c) Sampling results on symmetric objects: showing that our score scaling guidance prevents outlier poses while preserving the symmetric distribution, compared to sampling without guidance.
  • Figure 2: Visualization of sampled poses for a symmetric object under different scaling strategies. Rotation distribution is visualized using Mollweide projection inspired by murphy2021implicit, where yaw and pitch rotations are mapped to longitude-latitude coordinates, with roll as color. The center of the circle represents the ground truth pose.
  • Figure 3: Rotation (left) and translation (right) errors along sampling trajectories from T to 0, comparing our score scaling guidance (green) with baseline (blue). 50 initial noises were sampled. Solid lines and shaded regions indicate mean and standard deviation, respectively.
  • Figure 4: (a) compares various pre-training strategies alongside joint learning, and (b) shows from-scratch vs. joint learning performance with or without 6D pose regression pre-training. (c) illustrates the regression head performance during the pre-training and joint learning phases, and (d) compares performance under different sampling steps. All results are on REAL275, where shaded regions in (a), (b) and (d) indicate min/max over 3 evaluations.
  • Figure 5: Comparison between training from scratch and joint learning on HouseCat6D and ROPE datasets on $10^\circ 5\text{cm}$.
  • ...and 10 more figures