Table of Contents
Fetching ...

BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation

Yuanhong Yu, Xingyi He, Chen Zhao, Junhao Yu, Jiaqi Yang, Ruizhen Hu, Yujun Shen, Xing Zhu, Xiaowei Zhou, Sida Peng

TL;DR

BoxDreamer introduces a generalizable RGB-based approach for 6DoF object pose estimation from sparse views by using object bounding box corners as an intermediate representation. It recovers a 3D bounding box from sparse reference images and employs a transformer decoder to predict 2D projections of the box corners in the query image, establishing 2D-3D correspondences for $PnP$. The method achieves state-of-the-art performance on Occluded-LINEMOD and YCB-Video under sparse-view conditions, while delivering real-time inference (~$17$ ms per image) after offline bounding-box reconstruction. Key contributions include (1) bounding box corner representation, (2) a reference-based corner synthesizer producing 2D corner heatmaps, and (3) robust generalization across occlusion and texture variations with minimal reference views.

Abstract

This paper presents a generalizable RGB-based approach for object pose estimation, specifically designed to address challenges in sparse-view settings. While existing methods can estimate the poses of unseen objects, their generalization ability remains limited in scenarios involving occlusions and sparse reference views, restricting their real-world applicability. To overcome these limitations, we introduce corner points of the object bounding box as an intermediate representation of the object pose. The 3D object corners can be reliably recovered from sparse input views, while the 2D corner points in the target view are estimated through a novel reference-based point synthesizer, which works well even in scenarios involving occlusions. As object semantic points, object corners naturally establish 2D-3D correspondences for object pose estimation with a PnP algorithm. Extensive experiments on the YCB-Video and Occluded-LINEMOD datasets show that our approach outperforms state-of-the-art methods, highlighting the effectiveness of the proposed representation and significantly enhancing the generalization capabilities of object pose estimation, which is crucial for real-world applications.

BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation

TL;DR

BoxDreamer introduces a generalizable RGB-based approach for 6DoF object pose estimation from sparse views by using object bounding box corners as an intermediate representation. It recovers a 3D bounding box from sparse reference images and employs a transformer decoder to predict 2D projections of the box corners in the query image, establishing 2D-3D correspondences for . The method achieves state-of-the-art performance on Occluded-LINEMOD and YCB-Video under sparse-view conditions, while delivering real-time inference (~ ms per image) after offline bounding-box reconstruction. Key contributions include (1) bounding box corner representation, (2) a reference-based corner synthesizer producing 2D corner heatmaps, and (3) robust generalization across occlusion and texture variations with minimal reference views.

Abstract

This paper presents a generalizable RGB-based approach for object pose estimation, specifically designed to address challenges in sparse-view settings. While existing methods can estimate the poses of unseen objects, their generalization ability remains limited in scenarios involving occlusions and sparse reference views, restricting their real-world applicability. To overcome these limitations, we introduce corner points of the object bounding box as an intermediate representation of the object pose. The 3D object corners can be reliably recovered from sparse input views, while the 2D corner points in the target view are estimated through a novel reference-based point synthesizer, which works well even in scenarios involving occlusions. As object semantic points, object corners naturally establish 2D-3D correspondences for object pose estimation with a PnP algorithm. Extensive experiments on the YCB-Video and Occluded-LINEMOD datasets show that our approach outperforms state-of-the-art methods, highlighting the effectiveness of the proposed representation and significantly enhancing the generalization capabilities of object pose estimation, which is crucial for real-world applications.

Paper Structure

This paper contains 18 sections, 9 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison of Different Generalizable Paradigms. Unlike existing generalizable object pose estimation methods, BoxDreamer leverages a bounding box representation to handle incomplete object observations and achieves robust pose estimation even under severe occlusions.
  • Figure 2: Overview. For each object, BoxDreamer first recovers its rough structure from a set of reference images using the sparse-view reconstruction method. During object pose inference, BoxDreamer predicts 2D bounding box heatmaps for the query image guided by reference box corners, establishing 2D-3D correspondences and recovering the object pose through the PnP algorithm.
  • Figure 3: Qualitative Comparison on Occluded-LINEMOD and YCB-Video. Green boxes indicate the ground truth, while blue boxes represent the predicted results. Both quantitative and qualitative results demonstrate the method's effectiveness in occlusion.
  • Figure 4: The trend of performance on the LINEMOD dataset as the number of reference views changes.
  • Figure 5: Qualitative results on different bounding boxes. Top left: ground-truth object bounding box; Top right and others: bounding boxes recovered from DUSt3R using five reference images from three different reference databases introduced in Sec. \ref{['sec:setup']}.