Table of Contents
Fetching ...

Weak Cube R-CNN: Weakly Supervised 3D Detection using only 2D Bounding Boxes

Andreas Lau Hansen, Lukas Wanzeck, Dim P. Papadopoulos

TL;DR

This work tackles the high annotation cost of 3D object detection by proposing Weak Cube R-CNN, a monocular, weakly supervised method that learns to predict 3D cubes using only 2D bounding boxes during training. The model builds on Cube R-CNN but replaces 3D ground-truth supervision with pseudo-ground truths derived from frozen foundation models: metric depth from Depth-Anything, ground segmentation from GroundingDINO and SAM, and a depth-based ground plane estimate. A suite of weak losses—GIoU for 2D placement, depth alignment, size priors, normal alignment to the ground, and pose consistency across objects—drives the 3D head to produce coherent cube parameters (image coordinates, depth, dimensions, rotation, and uncertainty) without true 3D labels. Experiments on SUN RGB-D and KITTI show that Weak Cube R-CNN achieves higher accuracy than a time-equalized fully supervised baseline on several common categories and demonstrates the viability of using foundation-model-derived pseudo-ground truths, although centimetre-level precision remains out of reach. Overall, the approach offers a practical pathway to scalable 3D perception by leveraging 2D annotations and pretrained models, forming a strong foundation for future enhancements in weak supervision for 3D detection.

Abstract

Monocular 3D object detection is an essential task in computer vision, and it has several applications in robotics and virtual reality. However, 3D object detectors are typically trained in a fully supervised way, relying extensively on 3D labeled data, which is labor-intensive and costly to annotate. This work focuses on weakly-supervised 3D detection to reduce data needs using a monocular method that leverages a singlecamera system over expensive LiDAR sensors or multi-camera setups. We propose a general model Weak Cube R-CNN, which can predict objects in 3D at inference time, requiring only 2D box annotations for training by exploiting the relationship between 2D projections of 3D cubes. Our proposed method utilizes pre-trained frozen foundation 2D models to estimate depth and orientation information on a training set. We use these estimated values as pseudo-ground truths during training. We design loss functions that avoid 3D labels by incorporating information from the external models into the loss. In this way, we aim to implicitly transfer knowledge from these large foundation 2D models without having access to 3D bounding box annotations. Experimental results on the SUN RGB-D dataset show increased performance in accuracy compared to an annotation time equalized Cube R-CNN baseline. While not precise for centimetre-level measurements, this method provides a strong foundation for further research.

Weak Cube R-CNN: Weakly Supervised 3D Detection using only 2D Bounding Boxes

TL;DR

This work tackles the high annotation cost of 3D object detection by proposing Weak Cube R-CNN, a monocular, weakly supervised method that learns to predict 3D cubes using only 2D bounding boxes during training. The model builds on Cube R-CNN but replaces 3D ground-truth supervision with pseudo-ground truths derived from frozen foundation models: metric depth from Depth-Anything, ground segmentation from GroundingDINO and SAM, and a depth-based ground plane estimate. A suite of weak losses—GIoU for 2D placement, depth alignment, size priors, normal alignment to the ground, and pose consistency across objects—drives the 3D head to produce coherent cube parameters (image coordinates, depth, dimensions, rotation, and uncertainty) without true 3D labels. Experiments on SUN RGB-D and KITTI show that Weak Cube R-CNN achieves higher accuracy than a time-equalized fully supervised baseline on several common categories and demonstrates the viability of using foundation-model-derived pseudo-ground truths, although centimetre-level precision remains out of reach. Overall, the approach offers a practical pathway to scalable 3D perception by leveraging 2D annotations and pretrained models, forming a strong foundation for future enhancements in weak supervision for 3D detection.

Abstract

Monocular 3D object detection is an essential task in computer vision, and it has several applications in robotics and virtual reality. However, 3D object detectors are typically trained in a fully supervised way, relying extensively on 3D labeled data, which is labor-intensive and costly to annotate. This work focuses on weakly-supervised 3D detection to reduce data needs using a monocular method that leverages a singlecamera system over expensive LiDAR sensors or multi-camera setups. We propose a general model Weak Cube R-CNN, which can predict objects in 3D at inference time, requiring only 2D box annotations for training by exploiting the relationship between 2D projections of 3D cubes. Our proposed method utilizes pre-trained frozen foundation 2D models to estimate depth and orientation information on a training set. We use these estimated values as pseudo-ground truths during training. We design loss functions that avoid 3D labels by incorporating information from the external models into the loss. In this way, we aim to implicitly transfer knowledge from these large foundation 2D models without having access to 3D bounding box annotations. Experimental results on the SUN RGB-D dataset show increased performance in accuracy compared to an annotation time equalized Cube R-CNN baseline. While not precise for centimetre-level measurements, this method provides a strong foundation for further research.

Paper Structure

This paper contains 10 sections, 15 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Weak Cube R-CNN. In contrast to standard 3D object detectors that require 3D ground truths, our proposed method is trained using only 2D bounding boxes but can predict 3D cubes at test time. Weak Cube R-CNN significantly reduces the annotation time since 3D ground-truths require 11$\times$ more time than annotating 2D boxes. More importantly, it does not require access to LiDAR or multi-camera setups.
  • Figure 2: Overview of Weak Cube R-CNN. The model extracts features from an image and predicts objects in 2D and their cubes in 3D. We split the cube into each of its attributes and optimise each attribute with regards to a pseudo ground truth information. During training, instead of the simple 3D ground truth provided in the fully supervised setting, we must use many different sources of information provided by frozen models to emulate the same ground truth annotation.
  • Figure 3: Ground estimation pipeline showing the point cloud obtained through the depth map. The 2nd step selects the region in the depth map corresponding to the ground in the color image. The depth map is interpreted as a point cloud where plane-RANSAC obtains a normal vector to the ground.
  • Figure 4: Qualitative examples of Weak Cube R-CNN predictions on SUN-RGBD test set. Images are selected to showcase behaviour in various scenarios. Only the last row is shown with ground truths in red to avoid clutter. In the last row ground truths are shown in red with predictions in green. Each image is shown side-by-side with its corresponding top-view image, where each square is 1x1 m.
  • Figure 5: Qualitative examples of Weak Cube R-CNN predictions on KITTI test set. KITTI predictions are shown in green and ground truth in red. Each image is shown with its corresponding top-view image, where each square is 1x1 m.