Table of Contents
Fetching ...

WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion

Sajjad Pakdamansavoji, Yintao Ma, Amir Rasouli, Tongtong Cao

TL;DR

WALDO tackles 6D pose estimation under occlusion for unseen objects by integrating occlusion-aware inference with a robust training regime. It combines 3D-aware detection, ViT-based feature extraction, and a multi-stage pose estimation pipeline that uses dynamic, probability-guided dense sampling and multiple initial hypotheses, followed by iterative refinement. The approach is augmented with occlusion-focused training and an occlusion-aware evaluation protocol, yielding improved accuracy and faster inference on BOP-Core datasets, notably under heavy occlusion. Practically, WALDO offers a scalable, robust solution for real-world robotics and AR where occlusion and unseen objects are common, with fairer evaluation across visibility levels due to its UAR-based metrics.

Abstract

Accurate 6D object pose estimation is vital for robotics, augmented reality, and scene understanding. For seen objects, high accuracy is often attainable via per-object fine-tuning but generalizing to unseen objects remains a challenge. To address this problem, past arts assume access to CAD models at test time and typically follow a multi-stage pipeline to estimate poses: detect and segment the object, propose an initial pose, and then refine it. Under occlusion, however, the early-stage of such pipelines are prone to errors, which can propagate through the sequential processing, and consequently degrade the performance. To remedy this shortcoming, we propose four novel extensions to model-based 6D pose estimation methods: (i) a dynamic non-uniform dense sampling strategy that focuses computation on visible regions, reducing occlusion-induced errors; (ii) a multi-hypothesis inference mechanism that retains several confidence-ranked pose candidates, mitigating brittle single-path failures; (iii) iterative refinement to progressively improve pose accuracy; and (iv) series of occlusion-focused training augmentations that strengthen robustness and generalization. Furthermore, we propose a new weighted by visibility metric for evaluation under occlusion to minimize the bias in the existing protocols. Via extensive empirical evaluations, we show that our proposed approach achieves more than 5% improvement in accuracy on ICBIN and more than 2% on BOP dataset benchmarks, while achieving approximately 3 times faster inference.

WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion

TL;DR

WALDO tackles 6D pose estimation under occlusion for unseen objects by integrating occlusion-aware inference with a robust training regime. It combines 3D-aware detection, ViT-based feature extraction, and a multi-stage pose estimation pipeline that uses dynamic, probability-guided dense sampling and multiple initial hypotheses, followed by iterative refinement. The approach is augmented with occlusion-focused training and an occlusion-aware evaluation protocol, yielding improved accuracy and faster inference on BOP-Core datasets, notably under heavy occlusion. Practically, WALDO offers a scalable, robust solution for real-world robotics and AR where occlusion and unseen objects are common, with fairer evaluation across visibility levels due to its UAR-based metrics.

Abstract

Accurate 6D object pose estimation is vital for robotics, augmented reality, and scene understanding. For seen objects, high accuracy is often attainable via per-object fine-tuning but generalizing to unseen objects remains a challenge. To address this problem, past arts assume access to CAD models at test time and typically follow a multi-stage pipeline to estimate poses: detect and segment the object, propose an initial pose, and then refine it. Under occlusion, however, the early-stage of such pipelines are prone to errors, which can propagate through the sequential processing, and consequently degrade the performance. To remedy this shortcoming, we propose four novel extensions to model-based 6D pose estimation methods: (i) a dynamic non-uniform dense sampling strategy that focuses computation on visible regions, reducing occlusion-induced errors; (ii) a multi-hypothesis inference mechanism that retains several confidence-ranked pose candidates, mitigating brittle single-path failures; (iii) iterative refinement to progressively improve pose accuracy; and (iv) series of occlusion-focused training augmentations that strengthen robustness and generalization. Furthermore, we propose a new weighted by visibility metric for evaluation under occlusion to minimize the bias in the existing protocols. Via extensive empirical evaluations, we show that our proposed approach achieves more than 5% improvement in accuracy on ICBIN and more than 2% on BOP dataset benchmarks, while achieving approximately 3 times faster inference.

Paper Structure

This paper contains 22 sections, 10 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Scatter plot comparing unbiased average recall (UAR) versus inference time for multiple methods on the BOP-Core* dataset. WALDO performs better compared to past arts on both metrics.
  • Figure 2: Overview of our 6D pose estimation framework. The observed RGB–D image and the rendered RGB–D template are first sent to MUSE to obtain bounding boxes and masks. Using these, the inputs are processed and passed through a feature extractor to compute embeddings and their corresponding point clouds. A uniform coarse sample of the point clouds goes to a coarse point-matching module, which produces multiple initial pose hypotheses and per-point occlusion probabilities. Guided by these probabilities, we perform dynamic non-uniform dense sampling and iteratively refine the final pose.
  • Figure 3: (Top) Estimated occlusion/background probabilities visualized by color (red = higher probability). Comparing the observation with the coarse sample indicates accurate probability estimates; interpolation extends these estimates to the full point cloud. (Bottom) Probability-guided dynamic non-uniform sampling allocates more points to visible regions of the object compared to conventional static uniform sampling.
  • Figure 4: Qualitative examples of our proposed (Top) depth augmentation, (Middle) mask augmentation, and (Bottom) object templates rendered from different viewpoints.
  • Figure 5: Ratio of instances in the BOP-Core datasets divided into 10 deciles based on their visibility fraction.
  • ...and 3 more figures