Open Vocabulary Monocular 3D Object Detection
Jin Yao, Hao Gu, Xuweiyi Chen, Jiayun Wang, Zezhou Cheng
TL;DR
This work defines Open Vocabulary Monocular 3D Object Detection (OVMono3D) to detect objects from any category in metric 3D using a single image. It decouples 2D recognition/localization from 3D bounding-box estimation via two routes: a training-free GEO geometric unprojection and a data-driven LIFT lifting that fuses depth cues with vision features, enabling zero-shot generalization to novel categories. To address evaluation challenges, it introduces Target-Aware Metrics that mitigate missing-label and naming-ambiguity issues, and demonstrates state-of-the-art zero-shot and solid in-domain performance on Omni3D, highlighting depth estimation as a key bottleneck. The approach provides strong baselines and a practical benchmark for open-vocabulary monocular 3D detection, suggesting future enhancements through depth perception improvements and unsupervised learning integrated with foundation-model priors.
Abstract
We propose and study open-vocabulary monocular 3D detection, a novel task that aims to detect objects of any categores in metric 3D space from a single RGB image. Existing 3D object detectors either rely on costly sensors such as LiDAR or multi-view setups, or remain confined to closed vocabularies settings with limited categories, restricting their applicability. We identify two key challenges in this new setting. First, the scarcity of 3D bounding box annotations limits the ability to train generalizable models. To reduce dependence on 3D supervision, we propose a framework that effectively integrates pretrained 2D and 3D vision foundation models. Second, missing labels and semantic ambiguities (\eg, table vs. desk) in existing datasets hinder reliable evaluation. To address this, we design a novel metric that captures model performance while mitigating annotation issues. Our approach achieves state-of-the-art results in zero-shot 3D detection of novel categories as well as in-domain detection on seen classes. We hope our method provides a strong baseline and our evaluation protocol establishes a reliable benchmark for future research.
