Table of Contents
Fetching ...

Demo-Pose: Depth-Monocular Modality Fusion For Object Pose Estimation

Rachit Agarwal, Abhishek Joshi, Sathish Chalasani, Woo Jin Kim

Abstract

Object pose estimation is a fundamental task in 3D vision with applications in robotics, AR/VR, and scene understanding. We address the challenge of category-level 9-DoF pose estimation (6D pose + 3Dsize) from RGB-D input, without relying on CAD models during inference. Existing depth-only methods achieve strong results but ignore semantic cues from RGB, while many RGB-D fusion models underperform due to suboptimal cross-modal fusion that fails to align semantic RGB cues with 3D geometric representations. We propose DeMo-Pose, a hybrid architecture that fuses monocular semantic features with depth-based graph convolutional representations via a novel multimodal fusion strategy. To further improve geometric reasoning, we introduce a novel Mesh-Point Loss (MPL) that leverages mesh structure during training without adding inference overhead. Our approach achieves real-time inference and significantly improves over state-of-the-art methods across object categories, outperforming the strong GPV-Pose baseline by 3.2\% on 3D IoU and 11.1\% on pose accuracy on the REAL275 benchmark. The results highlight the effectiveness of depth-RGB fusion and geometry-aware learning, enabling robust category-level 3D pose estimation for real-world applications.

Demo-Pose: Depth-Monocular Modality Fusion For Object Pose Estimation

Abstract

Object pose estimation is a fundamental task in 3D vision with applications in robotics, AR/VR, and scene understanding. We address the challenge of category-level 9-DoF pose estimation (6D pose + 3Dsize) from RGB-D input, without relying on CAD models during inference. Existing depth-only methods achieve strong results but ignore semantic cues from RGB, while many RGB-D fusion models underperform due to suboptimal cross-modal fusion that fails to align semantic RGB cues with 3D geometric representations. We propose DeMo-Pose, a hybrid architecture that fuses monocular semantic features with depth-based graph convolutional representations via a novel multimodal fusion strategy. To further improve geometric reasoning, we introduce a novel Mesh-Point Loss (MPL) that leverages mesh structure during training without adding inference overhead. Our approach achieves real-time inference and significantly improves over state-of-the-art methods across object categories, outperforming the strong GPV-Pose baseline by 3.2\% on 3D IoU and 11.1\% on pose accuracy on the REAL275 benchmark. The results highlight the effectiveness of depth-RGB fusion and geometry-aware learning, enabling robust category-level 3D pose estimation for real-world applications.

Paper Structure

This paper contains 12 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Impact of inaccurate rotation, scale, and translation when overlaying a virtual keyboard template on a physical keyboard in MR. The bottom image shows correct alignment using pose estimation for improved user experience, as seen through the HMD Meta_Quest
  • Figure 2: Fusion module architecture to leverage RGB features obtained from Monocular detection model and fuse with Depth-based GCN features. In inference, we achieve real-time performance, suitable for device deployment. We provide more details in proposed method sec.\ref{['sec:proposed_method']}
  • Figure 3: (a) Comparison of predictions across video frames: GPV-Pose exhibits temporal instability for the laptop category (top row), while our DeMo-Pose fusion yields stable predictions (bottom row). (b) We represent predictions with green colored boxes and ground truth with black boxes, our method (bottom row) produces tighter and more accurate boxes for laptop and mug compared to GPV-Pose (top row).