Table of Contents
Fetching ...

GFreeDet: Exploiting Gaussian Splatting and Foundation Models for Model-free Unseen Object Detection in the BOP Challenge 2024

Xingyu Liu, Gu Wang, Chengxi Li, Yingyue Li, Chenyangguang Zhang, Ziqin Huang, Xiangyang Ji

TL;DR

The paper tackles model-free unseen object detection in open-world MR by learning unseen objects from short onboarding videos without CAD models. It introduces GFreeDet, which reconstructs a Gaussian object via Gaussian splatting from onboarding frames, renders 162 Gaussian-based templates ($N_ ext{T}=162$) and uses DINOv2 and SAM to perform zero-shot instance segmentation-based matching against test image proposals, with $K=5$ top-global matches and a local descriptor comparison. Key contributions include a unified Gaussian-based object reconstruction and template-rendering pipeline, plus a descriptor-based template matching framework that combines global and local features to yield amodal 2D detections, evaluated with learned metrics on the BOP-H3 benchmark. On HOT3D, HOPEv2, and HANDAL, GFreeDet achieves competitive $AP_{H3}$ (≈31.9%) with a fast variant (FastSAM) that delivers superior speed and won best overall and best fast method in the model-free 2D detection track, demonstrating the viability of model-free detection for mixed reality applications.

Abstract

We present GFreeDet, an unseen object detection approach that leverages Gaussian splatting and vision Foundation models under model-free setting. Unlike existing methods that rely on predefined CAD templates, GFreeDet reconstructs objects directly from reference videos using Gaussian splatting, enabling robust detection of novel objects without prior 3D models. Evaluated on the BOP-H3 benchmark, GFreeDet achieves comparable performance to CAD-based methods, demonstrating the viability of model-free detection for mixed reality (MR) applications. Notably, GFreeDet won the best overall method and the best fast method awards in the model-free 2D detection track at BOP Challenge 2024.

GFreeDet: Exploiting Gaussian Splatting and Foundation Models for Model-free Unseen Object Detection in the BOP Challenge 2024

TL;DR

The paper tackles model-free unseen object detection in open-world MR by learning unseen objects from short onboarding videos without CAD models. It introduces GFreeDet, which reconstructs a Gaussian object via Gaussian splatting from onboarding frames, renders 162 Gaussian-based templates () and uses DINOv2 and SAM to perform zero-shot instance segmentation-based matching against test image proposals, with top-global matches and a local descriptor comparison. Key contributions include a unified Gaussian-based object reconstruction and template-rendering pipeline, plus a descriptor-based template matching framework that combines global and local features to yield amodal 2D detections, evaluated with learned metrics on the BOP-H3 benchmark. On HOT3D, HOPEv2, and HANDAL, GFreeDet achieves competitive (≈31.9%) with a fast variant (FastSAM) that delivers superior speed and won best overall and best fast method in the model-free 2D detection track, demonstrating the viability of model-free detection for mixed reality applications.

Abstract

We present GFreeDet, an unseen object detection approach that leverages Gaussian splatting and vision Foundation models under model-free setting. Unlike existing methods that rely on predefined CAD templates, GFreeDet reconstructs objects directly from reference videos using Gaussian splatting, enabling robust detection of novel objects without prior 3D models. Evaluated on the BOP-H3 benchmark, GFreeDet achieves comparable performance to CAD-based methods, demonstrating the viability of model-free detection for mixed reality (MR) applications. Notably, GFreeDet won the best overall method and the best fast method awards in the model-free 2D detection track at BOP Challenge 2024.

Paper Structure

This paper contains 15 sections, 3 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of the pipeline of GFreeDet for model-free unseen object detection. For an unseen object, we first reconstruct the Gaussian object given calibrated onboarding images, object poses, and the estimated object masks. The Gaussian object is then used to render templates. During inference, we leverage DINOv2 to match the masked regions obtained by SAM against the rendered templates. The final object masks are obtained after filtering the matched results.
  • Figure 2: Visualization of reconstructed templates rendered by Gaussian Splatting. The objects are selected from HOPEv2, HANDAL and HOT3D from top to bottom.
  • Figure 3: Qualitative results on BOP-H3 datasets. Objects are colored by predicted masks. Best viewed by zooming in.