Table of Contents
Fetching ...

Improving 2D-3D Dense Correspondences with Diffusion Models for 6D Object Pose Estimation

Peter Hönig, Stefan Thalhammer, Markus Vincze

TL;DR

This work tackles RGB-only 6D object pose estimation by learning dense 2D-3D correspondences via image-to-image translation. It compares GAN-based Pix2Pix and diffusion-based Brownian-Bridge Diffusion Model (BBDM) under identical training conditions and data augmentations, demonstrating that diffusion models yield more accurate dense maps and superior pose estimates, as well as better segmentation. The study uses Linemod-Occluded with synthetic training and real testing, showing diffusion-based methods achieve higher $ADD(-S)$ and lower reconstruction error ($MSE$), albeit at higher runtime, and outperform contemporary methods like Pix2Pose and DPOD in AR benchmarks. Overall, diffusion-based image-to-image translation is shown to enhance geometry-aware perception pipelines, suggesting its broad applicability to robust 6D pose estimation in clutter and occlusion.

Abstract

Estimating 2D-3D correspondences between RGB images and 3D space is a fundamental problem in 6D object pose estimation. Recent pose estimators use dense correspondence maps and Point-to-Point algorithms to estimate object poses. The accuracy of pose estimation depends heavily on the quality of the dense correspondence maps and their ability to withstand occlusion, clutter, and challenging material properties. Currently, dense correspondence maps are estimated using image-to-image translation models based on GANs, Autoencoders, or direct regression models. However, recent advancements in image-to-image translation have led to diffusion models being the superior choice when evaluated on benchmarking datasets. In this study, we compare image-to-image translation networks based on GANs and diffusion models for the downstream task of 6D object pose estimation. Our results demonstrate that the diffusion-based image-to-image translation model outperforms the GAN, revealing potential for further improvements in 6D object pose estimation models.

Improving 2D-3D Dense Correspondences with Diffusion Models for 6D Object Pose Estimation

TL;DR

This work tackles RGB-only 6D object pose estimation by learning dense 2D-3D correspondences via image-to-image translation. It compares GAN-based Pix2Pix and diffusion-based Brownian-Bridge Diffusion Model (BBDM) under identical training conditions and data augmentations, demonstrating that diffusion models yield more accurate dense maps and superior pose estimates, as well as better segmentation. The study uses Linemod-Occluded with synthetic training and real testing, showing diffusion-based methods achieve higher and lower reconstruction error (), albeit at higher runtime, and outperform contemporary methods like Pix2Pose and DPOD in AR benchmarks. Overall, diffusion-based image-to-image translation is shown to enhance geometry-aware perception pipelines, suggesting its broad applicability to robust 6D pose estimation in clutter and occlusion.

Abstract

Estimating 2D-3D correspondences between RGB images and 3D space is a fundamental problem in 6D object pose estimation. Recent pose estimators use dense correspondence maps and Point-to-Point algorithms to estimate object poses. The accuracy of pose estimation depends heavily on the quality of the dense correspondence maps and their ability to withstand occlusion, clutter, and challenging material properties. Currently, dense correspondence maps are estimated using image-to-image translation models based on GANs, Autoencoders, or direct regression models. However, recent advancements in image-to-image translation have led to diffusion models being the superior choice when evaluated on benchmarking datasets. In this study, we compare image-to-image translation networks based on GANs and diffusion models for the downstream task of 6D object pose estimation. Our results demonstrate that the diffusion-based image-to-image translation model outperforms the GAN, revealing potential for further improvements in 6D object pose estimation models.
Paper Structure (10 sections, 3 equations, 3 figures, 5 tables)

This paper contains 10 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustration of 2D-3D Dense Correspondences Estimation using Diffusion Models. An arbitrary object detector is used to estimate a region of interest of an object. The diffusion model estimates the normalized object coordinates map from the RGB input crop. The RANSAC + PnP step solves the downstream task of 6D object pose estimation.
  • Figure 2: Illustration of location priors (top row) and rendered dense correspondences maps (bottom row). Ground truth translation, rotation and camera intrinsics are used to render the dense correspondences maps as identical image pair to the RGB location prior.
  • Figure 3: Influence of reconstruction quality on estimated 6D pose. Selected images from the LMO test data; comparison of pixel-wise error (top two rows), illustrated with the jet colormap and the influence on the estimated 6D poses (lower two rows); estimated bounding box in green, ground truth bounding box in blue.