Table of Contents
Fetching ...

MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane

Changwoo Jeon, Rishi Upadhyay, Achuta Kadambi

Abstract

Monocular 3D object understanding has largely been cast as a 2D RoI-to-3D box lifting problem. However, emerging downstream applications require image-plane geometry (e.g., projected 3D box corners) which cannot be easily obtained without known intrinsics, a problem for object detection in the wild. We introduce MoCA3D, a Monocular, Class-Agnostic 3D model that predicts projected 3D bounding box corners and per-corner depths without requiring camera intrinsics at inference time. MoCA3D formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps. To evaluate image-plane geometric fidelity, we propose Pixel-Aligned Geometry (PAG), which directly measures image-plane corner and depth consistency. Extensive experiments demonstrate that MoCA3D achieves state-of-the-art performance, improving image-plane corner PAG by 22.8% while remaining comparable on 3D IoU, using up to 57 times fewer trainable parameters. Finally, we apply MoCA3D to downstream tasks which were previously impractical under unknown intrinsics, highlighting its utility beyond standard baseline models.

MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane

Abstract

Monocular 3D object understanding has largely been cast as a 2D RoI-to-3D box lifting problem. However, emerging downstream applications require image-plane geometry (e.g., projected 3D box corners) which cannot be easily obtained without known intrinsics, a problem for object detection in the wild. We introduce MoCA3D, a Monocular, Class-Agnostic 3D model that predicts projected 3D bounding box corners and per-corner depths without requiring camera intrinsics at inference time. MoCA3D formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps. To evaluate image-plane geometric fidelity, we propose Pixel-Aligned Geometry (PAG), which directly measures image-plane corner and depth consistency. Extensive experiments demonstrate that MoCA3D achieves state-of-the-art performance, improving image-plane corner PAG by 22.8% while remaining comparable on 3D IoU, using up to 57 times fewer trainable parameters. Finally, we apply MoCA3D to downstream tasks which were previously impractical under unknown intrinsics, highlighting its utility beyond standard baseline models.
Paper Structure (40 sections, 24 equations, 9 figures, 4 tables)

This paper contains 40 sections, 24 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: MoCA3D architecture. Given an RGB image and a tight oracle 2D bounding box, MoCA3D uses a frozen DINOv3 backbone and a box-conditioned 3D Geometry Transformer with dense modules to predict eight corner heatmaps and per-corner depth maps, yielding pixel-aligned projected 3D box corners and depths.
  • Figure 2: Heatmap Comparison. Predicted corner heatmaps with (a) peak weight $=50.0$ and (b) $=1.0$ (uniform). Larger peak weight sharpens and localizes responses near GT corners, improving soft-argmax stability, while uniform weighting yields flatter heatmaps.
  • Figure 3: Qualitative Results. MoCA3D vs. DetAny3D predictions under oracle 2D boxes on samples from the KITTI, Omni3D, and Hypersim datasets. Detections in green have a 3D IoU of less than 0.1, making them low-quality detections.
  • Figure 4: Efficiency of MoCA3D. We compare (a) trainable parameters and (b) trade-off between efficiency and performance (PAG$_{uv}$). Efficiency is defined as the inverse of end-to-end inference time per example on CV-Bench tong2024cambrian1.
  • Figure 5: Driving scene variation guided by MoCA3D. MoCA3D recovers image-plane vehicle geometry (projected 3D box corners) from a single image and uses it to condition diffusion-based generation for diverse prompt-driven edits.
  • ...and 4 more figures