Zoo3D: Zero-Shot 3D Object Detection at Scene Level
Andrey Lemeshko, Bulat Gabdullin, Nikita Drozdov, Anton Konushin, Danila Rukhovich, Maksim Kolodiazhnyi
TL;DR
Zoo3D presents the first training-free, open-vocabulary 3D object detection framework operating at the scene level. It splits detection into a class-agnostic 3D box predictor, derived from MaskClustering, and an open-vocabulary label assignment module that leverages CLIP and SAM, enabling zero-shot Zoo3D0 alongside self-supervised Zoo3D1. The approach extends to posed and unposed multi-view images by incorporating DUSt3R to bridge 2D and 3D representations, allowing image-only inference without ground-truth poses. Across ScanNet-based benchmarks and ARKitScenes, Zoo3D achieves state-of-the-art open-vocabulary performance, often surpassing training-based methods in zero-shot and self-supervised settings, while highlighting the viability of training-free spatial understanding with off-the-shelf foundation models. The work opens avenues for faster, annotation-free 3D perception and motivates future improvements in reconstruction speed, segmentation quality, and label assignment efficiency.
Abstract
3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D$_0$, which requires no training at all, and the self-supervised Zoo3D$_1$, which refines 3D box prediction by training a class-agnostic detector on Zoo3D$_0$-generated pseudo labels. Furthermore, we extend Zoo3D beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both Zoo3D$_0$ and Zoo3D$_1$ achieve state-of-the-art results in open-vocabulary 3D object detection. Remarkably, our zero-shot Zoo3D$_0$ outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding. Code is available at https://github.com/col14m/zoo3d .
