Table of Contents
Fetching ...

Co-op: Correspondence-based Novel Object Pose Estimation

Sungphill Moon, Hyeontae Son, Dongcheol Hur, Sangwook Kim

TL;DR

Co-op tackles unseen object 6DoF pose estimation from a single RGB image by learning correspondences between the input and a small set of pre-rendered templates, enabling robust generalization without per-object training. The method combines a coarse stage with semi-dense, patch-level classification and offset regression and a dense refinement stage that uses probabilistic flow and a differentiable PnP for precise pose updates in a render-and-compare loop. A pose-selection module can generate multiple hypotheses to further boost accuracy, and CroCo-pretrained transformers underpin the entire architecture. On the seven core BOP datasets, Co-op delivers state-of-the-art performance in RGB-only settings and maintains strong results with RGB-D inputs, demonstrating rapid, accurate, and robust pose estimation for unseen objects in cluttered scenes.

Abstract

We propose Co-op, a novel method for accurately and robustly estimating the 6DoF pose of objects unseen during training from a single RGB image. Our method requires only the CAD model of the target object and can precisely estimate its pose without any additional fine-tuning. While existing model-based methods suffer from inefficiency due to using a large number of templates, our method enables fast and accurate estimation with a small number of templates. This improvement is achieved by finding semi-dense correspondences between the input image and the pre-rendered templates. Our method achieves strong generalization performance by leveraging a hybrid representation that combines patch-level classification and offset regression. Additionally, our pose refinement model estimates probabilistic flow between the input image and the rendered image, refining the initial estimate to an accurate pose using a differentiable PnP layer. We demonstrate that our method not only estimates object poses rapidly but also outperforms existing methods by a large margin on the seven core datasets of the BOP Challenge, achieving state-of-the-art accuracy.

Co-op: Correspondence-based Novel Object Pose Estimation

TL;DR

Co-op tackles unseen object 6DoF pose estimation from a single RGB image by learning correspondences between the input and a small set of pre-rendered templates, enabling robust generalization without per-object training. The method combines a coarse stage with semi-dense, patch-level classification and offset regression and a dense refinement stage that uses probabilistic flow and a differentiable PnP for precise pose updates in a render-and-compare loop. A pose-selection module can generate multiple hypotheses to further boost accuracy, and CroCo-pretrained transformers underpin the entire architecture. On the seven core BOP datasets, Co-op delivers state-of-the-art performance in RGB-only settings and maintains strong results with RGB-D inputs, demonstrating rapid, accurate, and robust pose estimation for unseen objects in cluttered scenes.

Abstract

We propose Co-op, a novel method for accurately and robustly estimating the 6DoF pose of objects unseen during training from a single RGB image. Our method requires only the CAD model of the target object and can precisely estimate its pose without any additional fine-tuning. While existing model-based methods suffer from inefficiency due to using a large number of templates, our method enables fast and accurate estimation with a small number of templates. This improvement is achieved by finding semi-dense correspondences between the input image and the pre-rendered templates. Our method achieves strong generalization performance by leveraging a hybrid representation that combines patch-level classification and offset regression. Additionally, our pose refinement model estimates probabilistic flow between the input image and the rendered image, refining the initial estimate to an accurate pose using a differentiable PnP layer. We demonstrate that our method not only estimates object poses rapidly but also outperforms existing methods by a large margin on the seven core datasets of the BOP Challenge, achieving state-of-the-art accuracy.

Paper Structure

This paper contains 26 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Examples of 6D pose estimation of novel objects. Our method estimates semi-dense or dense correspondences between the input image and rendered images and uses them to estimate the pose.
  • Figure 2: Overview. We estimate object pose through two main stages. In the Coarse Pose Estimation stage (Sec \ref{['sec:coarseestimation']}), we estimate semi-dense correspondences between the query image and templates and compute the initial pose using PnP. In the Pose Refinement stage (Sec \ref{['sec:poserefinement']}), we refine the initial pose by estimating dense flow between the query and rendered images. Both stages utilize transformer encoders and decoders with identical structures, with the Pose Refinement stage additionally incorporating a DPT module after the decoder for dense prediction.
  • Figure 3: Visualization of Our Hybrid Representation. Left: Patch-level classification results; matching patches are highlighted with the same color. Right: Offset regression within template patches to refine correspondences; red arrows represent the estimated offsets.
  • Figure 4: Pose Selection. To achieve more precise pose estimation using a multiple hypothesis strategy, we introduce a pose selection stage (Sec \ref{['sec:poseselection']}).
  • Figure 5: In-plane Rotation Invariant Matching Example. From left to right: Query image, semi-dense correspondences between the query image and the best scoring template, and the coarse pose recovered using the PnP algorithm.
  • ...and 2 more figures