Table of Contents
Fetching ...

Zoo3D: Zero-Shot 3D Object Detection at Scene Level

Andrey Lemeshko, Bulat Gabdullin, Nikita Drozdov, Anton Konushin, Danila Rukhovich, Maksim Kolodiazhnyi

TL;DR

Zoo3D presents the first training-free, open-vocabulary 3D object detection framework operating at the scene level. It splits detection into a class-agnostic 3D box predictor, derived from MaskClustering, and an open-vocabulary label assignment module that leverages CLIP and SAM, enabling zero-shot Zoo3D0 alongside self-supervised Zoo3D1. The approach extends to posed and unposed multi-view images by incorporating DUSt3R to bridge 2D and 3D representations, allowing image-only inference without ground-truth poses. Across ScanNet-based benchmarks and ARKitScenes, Zoo3D achieves state-of-the-art open-vocabulary performance, often surpassing training-based methods in zero-shot and self-supervised settings, while highlighting the viability of training-free spatial understanding with off-the-shelf foundation models. The work opens avenues for faster, annotation-free 3D perception and motivates future improvements in reconstruction speed, segmentation quality, and label assignment efficiency.

Abstract

3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D$_0$, which requires no training at all, and the self-supervised Zoo3D$_1$, which refines 3D box prediction by training a class-agnostic detector on Zoo3D$_0$-generated pseudo labels. Furthermore, we extend Zoo3D beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both Zoo3D$_0$ and Zoo3D$_1$ achieve state-of-the-art results in open-vocabulary 3D object detection. Remarkably, our zero-shot Zoo3D$_0$ outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding. Code is available at https://github.com/col14m/zoo3d .

Zoo3D: Zero-Shot 3D Object Detection at Scene Level

TL;DR

Zoo3D presents the first training-free, open-vocabulary 3D object detection framework operating at the scene level. It splits detection into a class-agnostic 3D box predictor, derived from MaskClustering, and an open-vocabulary label assignment module that leverages CLIP and SAM, enabling zero-shot Zoo3D0 alongside self-supervised Zoo3D1. The approach extends to posed and unposed multi-view images by incorporating DUSt3R to bridge 2D and 3D representations, allowing image-only inference without ground-truth poses. Across ScanNet-based benchmarks and ARKitScenes, Zoo3D achieves state-of-the-art open-vocabulary performance, often surpassing training-based methods in zero-shot and self-supervised settings, while highlighting the viability of training-free spatial understanding with off-the-shelf foundation models. The work opens avenues for faster, annotation-free 3D perception and motivates future improvements in reconstruction speed, segmentation quality, and label assignment efficiency.

Abstract

3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D, which requires no training at all, and the self-supervised Zoo3D, which refines 3D box prediction by training a class-agnostic detector on Zoo3D-generated pseudo labels. Furthermore, we extend Zoo3D beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both Zoo3D and Zoo3D achieve state-of-the-art results in open-vocabulary 3D object detection. Remarkably, our zero-shot Zoo3D outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding. Code is available at https://github.com/col14m/zoo3d .

Paper Structure

This paper contains 45 sections, 8 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Open-vocabulary 3D object detection aims to localize 3D bounding boxes given a textual description. We demonstrate that this task can be solved in a zero-shot mode (Zoo3D0). Our self-supervised image-based approach Zoo3D1 performs on par with point cloud-based methods that are trained with 3D bounding boxes supervision.
  • Figure 1: Data used for training Zoo3D models.
  • Figure 2: Inference pipeline of Zoo3D given point cloud inputs. Zoo3D0 leverages MaskClustering to predict class-agnostic 3D bounding boxes from a point cloud and images (top-left), while Zoo3D1 infers 3D bounding boxes from point clouds with TR3D (bottom-left). Both Zoo3D0 and Zoo3D1 assign semantic labels to 3D bounding boxes with the same Open-vocabulary Module (right). Given images, a full point cloud, and a 3D bounding box of an object, it crops the point cloud using the 3D bounding box, selects top k views based on the visibility, and projects visible points of the object onto these views. Object masks are obtained with SAM, CLIP embeddings are aggregated across views, and the text label with the most similar embedding is assigned.
  • Figure 3: Three operation modes of Zoo3D: with images with corresponding camera poses and point clouds as inputs (a), posed images (b) and unposed images (c). Ground truth input modalities are marked blue. In two latter scenarios, missing modalities are derived using DUSt3R.
  • Figure 4: Qualitative results of Zoo3D1 on ScanNet200.
  • ...and 3 more figures