Table of Contents
Fetching ...

You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation

Hakjin Lee, Junghoon Seo, Jaehoon Sim

TL;DR

This work tackles monocular RGB, category-level $9$-DoF pose estimation by proposing YOPO, a single-stage, end-to-end RGB-only transformer detector that unifies object detection with 3D pose reasoning. YOPO directly predicts $(c, oldsymbol{R}, oldsymbol{t}, oldsymbol{s})$ in one forward pass, using a bounding-box–conditioned 3D module and a 6D-aware bipartite matching objective, trained solely from RGB images with pose labels. Across CAMERA25, REAL275, and HouseCat6D, YOPO achieves state-of-the-art performance among RGB-only methods and narrows the gap to RGB-D systems, with ablations highlighting the importance of 3D-aware matching and bounding-box conditioning. The approach offers a simple, scalable baseline for RGB-only 9D perception and opens avenues for robustness to occlusion, domain shift, and temporal cues in future work.

Abstract

Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% $\rm{IoU}_{50}$ and 54.1% under the $10^\circ$$10{\rm{cm}}$ metric, surpassing prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on https://mikigom.github.io/YOPO-project-page.

You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation

TL;DR

This work tackles monocular RGB, category-level -DoF pose estimation by proposing YOPO, a single-stage, end-to-end RGB-only transformer detector that unifies object detection with 3D pose reasoning. YOPO directly predicts in one forward pass, using a bounding-box–conditioned 3D module and a 6D-aware bipartite matching objective, trained solely from RGB images with pose labels. Across CAMERA25, REAL275, and HouseCat6D, YOPO achieves state-of-the-art performance among RGB-only methods and narrows the gap to RGB-D systems, with ablations highlighting the importance of 3D-aware matching and bounding-box conditioning. The approach offers a simple, scalable baseline for RGB-only 9D perception and opens avenues for robustness to occlusion, domain shift, and temporal cues in future work.

Abstract

Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% and 54.1% under the metric, surpassing prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on https://mikigom.github.io/YOPO-project-page.

Paper Structure

This paper contains 21 sections, 7 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Main contribution of this paper. Unlike prevailing category-level pose estimation methods that rely on external geometric priors such as 3D CAD models, instance segmentation masks, or pseudo-depth maps (top), our framework is end-to-end and requires none of these (bottom). Using only a raw RGB image as input, YOPO delivers state-of-the-art joint detection and 9D pose estimation for all objects in a single forward pass, with no intermediate steps or post-processing.
  • Figure 2: Overview of our method. (a) The model predicts object properties from transformer-decoder outputs using task-specific heads. (b) The translation and depth head estimates 2D center locations as offsets from bounding-box centers, enabling 3D translation and depth recovery via back-projection. Predicted bounding boxes are concatenated with the input query to provide spatial information for accurate 2D center and depth estimation.
  • Figure 3: Qualitative comparison of pose estimation results on the REAL275 dataset. We compare our model with MonoDiff9D jian2025monodiff9d. Predicted poses are shown in red, while ground-truth annotations are shown in green.