Table of Contents
Fetching ...

Detect Anything 3D in the Wild

Hanxue Zhang, Haoran Jiang, Qingsong Yao, Yanan Sun, Renrui Zhang, Hao Zhao, Hongyang Li, Hongzi Zhu, Zetong Yang

TL;DR

This work tackles zero-shot generalization in 3D object detection from monocular imagery by building a promptable 3D foundation model, DetAny3D. It leverages strong 2D priors from SAM and DINO via a 2D Aggregator and introduces a 3D Interpreter with Zero-Embedding Mapping to safely transfer 2D knowledge into 3D, guided by depth and intrinsic cues. Training on the diverse DA3D dataset enables open-world 3D detection across unseen categories and novel camera configurations, with significant gains over prior baselines in zero-shot settings and competitive in-domain performance. The approach opens pathways for robust, open-world 3D perception in real-world applications like autonomous driving and embodied AI, while highlighting areas for future work such as temporal modeling and real-time efficiency.

Abstract

Despite the success of deep learning in close-set 3D object detection, existing approaches struggle with zero-shot generalization to novel objects and camera configurations. We introduce DetAny3D, a promptable 3D detection foundation model capable of detecting any novel object under arbitrary camera configurations using only monocular inputs. Training a foundation model for 3D detection is fundamentally constrained by the limited availability of annotated 3D data, which motivates DetAny3D to leverage the rich prior knowledge embedded in extensively pre-trained 2D foundation models to compensate for this scarcity. To effectively transfer 2D knowledge to 3D, DetAny3D incorporates two core modules: the 2D Aggregator, which aligns features from different 2D foundation models, and the 3D Interpreter with Zero-Embedding Mapping, which stabilizes early training in 2D-to-3D knowledge transfer. Experimental results validate the strong generalization of our DetAny3D, which not only achieves state-of-the-art performance on unseen categories and novel camera configurations, but also surpasses most competitors on in-domain data. DetAny3D sheds light on the potential of the 3D foundation model for diverse applications in real-world scenarios, e.g., rare object detection in autonomous driving, and demonstrates promise for further exploration of 3D-centric tasks in open-world settings. More visualization results can be found at our code repository.

Detect Anything 3D in the Wild

TL;DR

This work tackles zero-shot generalization in 3D object detection from monocular imagery by building a promptable 3D foundation model, DetAny3D. It leverages strong 2D priors from SAM and DINO via a 2D Aggregator and introduces a 3D Interpreter with Zero-Embedding Mapping to safely transfer 2D knowledge into 3D, guided by depth and intrinsic cues. Training on the diverse DA3D dataset enables open-world 3D detection across unseen categories and novel camera configurations, with significant gains over prior baselines in zero-shot settings and competitive in-domain performance. The approach opens pathways for robust, open-world 3D perception in real-world applications like autonomous driving and embodied AI, while highlighting areas for future work such as temporal modeling and real-time efficiency.

Abstract

Despite the success of deep learning in close-set 3D object detection, existing approaches struggle with zero-shot generalization to novel objects and camera configurations. We introduce DetAny3D, a promptable 3D detection foundation model capable of detecting any novel object under arbitrary camera configurations using only monocular inputs. Training a foundation model for 3D detection is fundamentally constrained by the limited availability of annotated 3D data, which motivates DetAny3D to leverage the rich prior knowledge embedded in extensively pre-trained 2D foundation models to compensate for this scarcity. To effectively transfer 2D knowledge to 3D, DetAny3D incorporates two core modules: the 2D Aggregator, which aligns features from different 2D foundation models, and the 3D Interpreter with Zero-Embedding Mapping, which stabilizes early training in 2D-to-3D knowledge transfer. Experimental results validate the strong generalization of our DetAny3D, which not only achieves state-of-the-art performance on unseen categories and novel camera configurations, but also surpasses most competitors on in-domain data. DetAny3D sheds light on the potential of the 3D foundation model for diverse applications in real-world scenarios, e.g., rare object detection in autonomous driving, and demonstrates promise for further exploration of 3D-centric tasks in open-world settings. More visualization results can be found at our code repository.

Paper Structure

This paper contains 34 sections, 15 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Introducing DetAny3D, a promptable 3D detection foundation model capable of detecting any 3D object with arbitrary monocular images in diverse scenes. Our framework enables multi-prompt interaction (e.g., box, point, and text) to deliver open-world 3D detection results $(w \times h \times l$ in centimeter) for novel objects across various domains. It achieves significant zero-shot generalization, outperforming SOTA by up to 21.02 and 5.68 ${\rm AP_{3D}}$ on novel categories and novel datasets with new camera configurations.
  • Figure 2: Overview of DetAny3D. It supports arbitrary monocular images as input and performs 3D object detection driven by prompts—box, point, and text to specify target objects and optional camera calibration to calibrate geometric projections. DetAny3D comprises two key modules: (b) 2D Aggregator, which employs a hierarchical cross-attention mechanism to dynamically fuse knowledge from SAM and DINO, with a learnable gate controlling each component’s contribution to the geometric embedding; (c) 3D Interpreter, which introduces a Zero-Embedding Mapping (ZEM) strategy based on zero-initialized layers to gradually inject geometric priors, thereby enables zero-shot 3D grounding and avoids catastrophic forgetting during knowledge transfer.
  • Figure 3: Zero-Shot Transfer Video Generation via Sora. We provide Sora with Internet-sourced images. As shown, when controlled with 3D bounding box, Sora can better capture the scene’s geometric relationships. In contrast, with only controlled by 2D bounding box prompt, Sora respects pixel-level spatial cues but fails to generate accurate geometric offset.
  • Figure 4: Qualitative Results. We present qualitative examples from open-world detection. In each pair of images, the top row is produced by OVMono3D, and the bottom row by DetAny3D. For each example, the left sub-figure overlays the projected 3D bounding boxes, while the right sub-figure shows the corresponding bird’s-eye view with 1m$\times$1m grids as the background.
  • Figure 5: The composition of the DA3D dataset.
  • ...and 4 more figures