Table of Contents
Fetching ...

HiPose: Hierarchical Binary Surface Encoding and Correspondence Pruning for RGB-D 6DoF Object Pose Estimation

Yongliang Lin, Yongzhi Su, Praveen Nathan, Sandeep Inuganti, Yan Di, Martin Sundermeyer, Fabian Manhardt, Didier Stricker, Jason Rambach, Yu Zhang

TL;DR

HiPose presents a real-time RGB-D 6DoF object pose estimator that removes the need for refinement by learning dense 3D-3D correspondences through a hierarchical binary surface encoding. The method uses a coarse-to-fine, RANSAC-free pipeline with hierarchical correspondence pruning, leveraging a bidirectional CNN-RandLANet fusion backbone and a Kabsch solver to progressively refine pose and exclude outliers. Across LM-O, YCB-V, and T-LESS, HiPose achieves state-of-the-art or competitive accuracy without rendering-based refinement, while being approximately 40x faster than refinement-based counterparts. Trained primarily on synthetic data, the approach demonstrates robustness to depth noise and occlusion, making it suitable for real-time, depth-enabled robotics and AR applications.

Abstract

In this work, we present a novel dense-correspondence method for 6DoF object pose estimation from a single RGB-D image. While many existing data-driven methods achieve impressive performance, they tend to be time-consuming due to their reliance on rendering-based refinement approaches. To circumvent this limitation, we present HiPose, which establishes 3D-3D correspondences in a coarse-to-fine manner with a hierarchical binary surface encoding. Unlike previous dense-correspondence methods, we estimate the correspondence surface by employing point-to-surface matching and iteratively constricting the surface until it becomes a correspondence point while gradually removing outliers. Extensive experiments on public benchmarks LM-O, YCB-V, and T-Less demonstrate that our method surpasses all refinement-free methods and is even on par with expensive refinement-based approaches. Crucially, our approach is computationally efficient and enables real-time critical applications with high accuracy requirements.

HiPose: Hierarchical Binary Surface Encoding and Correspondence Pruning for RGB-D 6DoF Object Pose Estimation

TL;DR

HiPose presents a real-time RGB-D 6DoF object pose estimator that removes the need for refinement by learning dense 3D-3D correspondences through a hierarchical binary surface encoding. The method uses a coarse-to-fine, RANSAC-free pipeline with hierarchical correspondence pruning, leveraging a bidirectional CNN-RandLANet fusion backbone and a Kabsch solver to progressively refine pose and exclude outliers. Across LM-O, YCB-V, and T-LESS, HiPose achieves state-of-the-art or competitive accuracy without rendering-based refinement, while being approximately 40x faster than refinement-based counterparts. Trained primarily on synthetic data, the approach demonstrates robustness to depth noise and occlusion, making it suitable for real-time, depth-enabled robotics and AR applications.

Abstract

In this work, we present a novel dense-correspondence method for 6DoF object pose estimation from a single RGB-D image. While many existing data-driven methods achieve impressive performance, they tend to be time-consuming due to their reliance on rendering-based refinement approaches. To circumvent this limitation, we present HiPose, which establishes 3D-3D correspondences in a coarse-to-fine manner with a hierarchical binary surface encoding. Unlike previous dense-correspondence methods, we estimate the correspondence surface by employing point-to-surface matching and iteratively constricting the surface until it becomes a correspondence point while gradually removing outliers. Extensive experiments on public benchmarks LM-O, YCB-V, and T-Less demonstrate that our method surpasses all refinement-free methods and is even on par with expensive refinement-based approaches. Crucially, our approach is computationally efficient and enables real-time critical applications with high accuracy requirements.
Paper Structure (25 sections, 4 equations, 10 figures, 9 tables)

This paper contains 25 sections, 4 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Illustration of HiPose : (a) For every point cloud with color and normals as inputs, our network outputs a binary code to establish a correspondence to a sub-surface on the object. (b) With the coarse level matching, we estimate an initial pose $pose_m$. The additional $n$ bits are used for iterative fine-grained matching and pose estimation and gradual outlier rejection. Note that this process is render-free and RANSAC-free, ensuring fast performance of our algorithm.
  • Figure 2: Overview : Our framework uses an RGB-D image crop as input and predicts an $m+n$ bits binary code using a full-flow bidirectional fusion network for every point cloud patch on the target object. The first $m$ bit codes point to a relatively coarse surface (blue line), while the final $n$ bit codes are used $n$ times as indicators to perform hierarchical surface partitioning (red lines). Through the iterative process of identifying fine-grained point-to-surface correspondences, the algorithm finally yields an accurately estimated pose. The colored patches on the model represent different surface partitions.
  • Figure 3: Correspondence Pruning. The green line demonstrates an example where the point cloud lies in the transformed surface under the estimated pose. The distance between the point in the point cloud and the transformed surface is represented by the blue dashed line. The red line demonstrates another case where the transformed surface is far away from the point in the point cloud (yellow point). Consequently, the correspondence depicted by the red line will be removed in the next iteration.
  • Figure 4: We conduct an ablation study on selecting the default initial bit $m_{default}$ using $8$ objects from the LM-O dataset. The flat curve illustrates that our proposed design is robust and has a clear advantage to the non-hierarchical variant ($16$ as initial bit).
  • Figure 5: Network Architecture : The network comprises four encoder blocks and four decoder blocks. Each block performs upsampling or downsampling of the input, processes the RGB and point features, and subsequently merges them except the last decoder block. In the RGB image branch, we employ ConvNeXt blocks Liu2022ACF as the encoders and PSPNet blocks zhao2017pyramid as the decoders. As for the point cloud branch, we utilize modules derived from Randla hu2020randla. Here, 'bsz' refers to the batch size, 'npts' denotes the number of points, and 'H/W' represents the height and width of the image.
  • ...and 5 more figures