Table of Contents
Fetching ...

Open Vocabulary Monocular 3D Object Detection

Jin Yao, Hao Gu, Xuweiyi Chen, Jiayun Wang, Zezhou Cheng

TL;DR

This work defines Open Vocabulary Monocular 3D Object Detection (OVMono3D) to detect objects from any category in metric 3D using a single image. It decouples 2D recognition/localization from 3D bounding-box estimation via two routes: a training-free GEO geometric unprojection and a data-driven LIFT lifting that fuses depth cues with vision features, enabling zero-shot generalization to novel categories. To address evaluation challenges, it introduces Target-Aware Metrics that mitigate missing-label and naming-ambiguity issues, and demonstrates state-of-the-art zero-shot and solid in-domain performance on Omni3D, highlighting depth estimation as a key bottleneck. The approach provides strong baselines and a practical benchmark for open-vocabulary monocular 3D detection, suggesting future enhancements through depth perception improvements and unsupervised learning integrated with foundation-model priors.

Abstract

We propose and study open-vocabulary monocular 3D detection, a novel task that aims to detect objects of any categores in metric 3D space from a single RGB image. Existing 3D object detectors either rely on costly sensors such as LiDAR or multi-view setups, or remain confined to closed vocabularies settings with limited categories, restricting their applicability. We identify two key challenges in this new setting. First, the scarcity of 3D bounding box annotations limits the ability to train generalizable models. To reduce dependence on 3D supervision, we propose a framework that effectively integrates pretrained 2D and 3D vision foundation models. Second, missing labels and semantic ambiguities (\eg, table vs. desk) in existing datasets hinder reliable evaluation. To address this, we design a novel metric that captures model performance while mitigating annotation issues. Our approach achieves state-of-the-art results in zero-shot 3D detection of novel categories as well as in-domain detection on seen classes. We hope our method provides a strong baseline and our evaluation protocol establishes a reliable benchmark for future research.

Open Vocabulary Monocular 3D Object Detection

TL;DR

This work defines Open Vocabulary Monocular 3D Object Detection (OVMono3D) to detect objects from any category in metric 3D using a single image. It decouples 2D recognition/localization from 3D bounding-box estimation via two routes: a training-free GEO geometric unprojection and a data-driven LIFT lifting that fuses depth cues with vision features, enabling zero-shot generalization to novel categories. To address evaluation challenges, it introduces Target-Aware Metrics that mitigate missing-label and naming-ambiguity issues, and demonstrates state-of-the-art zero-shot and solid in-domain performance on Omni3D, highlighting depth estimation as a key bottleneck. The approach provides strong baselines and a practical benchmark for open-vocabulary monocular 3D detection, suggesting future enhancements through depth perception improvements and unsupervised learning integrated with foundation-model priors.

Abstract

We propose and study open-vocabulary monocular 3D detection, a novel task that aims to detect objects of any categores in metric 3D space from a single RGB image. Existing 3D object detectors either rely on costly sensors such as LiDAR or multi-view setups, or remain confined to closed vocabularies settings with limited categories, restricting their applicability. We identify two key challenges in this new setting. First, the scarcity of 3D bounding box annotations limits the ability to train generalizable models. To reduce dependence on 3D supervision, we propose a framework that effectively integrates pretrained 2D and 3D vision foundation models. Second, missing labels and semantic ambiguities (\eg, table vs. desk) in existing datasets hinder reliable evaluation. To address this, we design a novel metric that captures model performance while mitigating annotation issues. Our approach achieves state-of-the-art results in zero-shot 3D detection of novel categories as well as in-domain detection on seen classes. We hope our method provides a strong baseline and our evaluation protocol establishes a reliable benchmark for future research.

Paper Structure

This paper contains 37 sections, 3 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Given (a) a single image, we illustrate examples of (b) open-vocabulary 2D detection, which localizes objects of any category within the 2D image plane, covering both seen categories and novel categories not seen during training; (c) closed-vocabulary 3D detection, which detects objects from a predefined set of categories in 3D space; and (d) open-vocabulary 3D object detection, which identifies objects of any category in 3D.
  • Figure 2: 2D vs. 3D Detection Datasets in terms of #Images. Publicly available 3D datasets with 3D annotations are significantly smaller than 2D detection datasets.
  • Figure 3: Proposed Methods.(a)OVMono3D-GEO is a training-free method that predicts 3D detections from 2D via geometric unprojection using off-the-shelf depth estimation (i.e. UniDepthv2 piccinelli2025unidepthv2), segmentation (i.e. SAM kirillov2023segment), and OV 2D detector liu2023grounding. (b)OVMono3D-LIFT is a learning-based approach that trains a class-agnostic neural network to lift 2D detections and geometric information to 3D. Both approaches decouple 2D recognition and localization from 3D bounding box estimation.
  • Figure 4: By prompting only categories that exist in the annotations, our target-aware evaluation mitigates the negative impact of missing annotations (e.g., "book" in (a) ) and naming ambiguity (e.g., "vase" vs. "potted plant" and "chair" vs. "sofa".)
  • Figure 5: Qualitative Visualizations on the Omni3D Test Set. For each example, we present the ground truth annotations, the predictions of Cube R-CNN and OVMono3D-LIFT, displaying the 3D predictions overlaid on the image. For OVMono3D-LIFT, we also present a top-down view with a base grid of $1\,\text{m} \times 1\,\text{m}$ tiles. Base categories are depicted with brown cubes, while novel categories are represented in other colors. Zoom in for best viewing. See \ref{['sec:more_qualitative']} for more visualizations.
  • ...and 7 more figures