Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance

Kuan-Chih Huang; Yi-Hsuan Tsai; Ming-Hsuan Yang

Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance

Kuan-Chih Huang, Yi-Hsuan Tsai, Ming-Hsuan Yang

TL;DR

This work tackles the high annotation cost of 3D object detection by proposing VG-W3D, a multi-level visual guidance framework that learns a 3D detector from 2D annotations alone. It integrates three visual cues—feature-level objectness alignment, output-level 2D–3D box overlap via a $\text{GIoU}$-based loss, and training-level image-guided pseudo-label refinement—alongside a frustum-based proposal generator and a frozen 2D detector. The method achieves competitive results on KITTI without any 3D labels, outperforming several weakly supervised baselines and rivaling methods that require hundreds of 3D annotations, while leveraging off-the-shelf 2D detectors. This approach significantly reduces annotation effort for 3D perception and offers a scalable pathway for integrating 2D visual signals into 3D understanding, with code to be released publicly.

Abstract

Weakly supervised 3D object detection aims to learn a 3D detector with lower annotation cost, e.g., 2D labels. Unlike prior work which still relies on few accurate 3D annotations, we propose a framework to study how to leverage constraints between 2D and 3D domains without requiring any 3D labels. Specifically, we employ visual data from three perspectives to establish connections between 2D and 3D domains. First, we design a feature-level constraint to align LiDAR and image features based on object-aware regions. Second, the output-level constraint is developed to enforce the overlap between 2D and projected 3D box estimations. Finally, the training-level constraint is utilized by producing accurate and consistent 3D pseudo-labels that align with the visual data. We conduct extensive experiments on the KITTI dataset to validate the effectiveness of the proposed three constraints. Without using any 3D labels, our method achieves favorable performance against state-of-the-art approaches and is competitive with the method that uses 500-frame 3D annotations. Code will be made publicly available at https://github.com/kuanchihhuang/VG-W3D.

Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance

TL;DR

-based loss, and training-level image-guided pseudo-label refinement—alongside a frustum-based proposal generator and a frozen 2D detector. The method achieves competitive results on KITTI without any 3D labels, outperforming several weakly supervised baselines and rivaling methods that require hundreds of 3D annotations, while leveraging off-the-shelf 2D detectors. This approach significantly reduces annotation effort for 3D perception and offers a scalable pathway for integrating 2D visual signals into 3D understanding, with code to be released publicly.

Abstract

Paper Structure (14 sections, 8 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 14 sections, 8 equations, 5 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Proposed Approach
Framework Overview
Feature-Level Visual Guidance
Output-Level Visual Guidance
Training-Level Visual Guidance
Training Objectives
Experiments
Experimental Setup
Main Results
Ablation Study and Analysis
Qualitative Results
Conclusions

Figures (5)

Figure 1: Multi-level visual guidance for weakly-supervised 3D object detection. We propose a framework to learn a 3D object detector from weak labels, e.g., 2D bounding boxes on the image plane, using three different perspectives, including feature-, output-, and training-level constraints. Feature-level provides object-aware signals for point feature learning. Output-level incorporates 2D-3D box constraints to enforce the model to generate reasonable box prediction. Training-level guidance incorporates the confidence of 2D boxes into the pseudo-label technique to ensure the score consistency between 2D and 3D domains.
Figure 2: Overall framework of the proposed VG-W3D. We utilize a non-learning method wei2021fgr to identify frustum point clouds of objects, followed by a heuristic algorithm to estimate the initial noisy bounding boxes (top right). In the image branch, we train an object detector based on 2D annotations to predict image features $\mathbf{F}_{\mathcal{I}}$ and 2D bounding boxes $\mathbf{B}_{\mathcal{I}}$ along with their confidence scores $\sigma_{\mathcal{I}}$, which serve as visual guidance for training the 3D detector. Then, a PointNet-based 3D object detector is employed to extract point features $\mathbf{F}_{\mathcal{P}}$ and output 3D bounding boxes $\mathbf{B}_{\mathcal{P}}$ along with confidence scores $\sigma_{\mathcal{P}}$. Our approach incorporates three levels of visual guidance for 3D training, namely feature-level (Section \ref{['sec:feature']}), output-level (Section \ref{['sec:output']}), and training-level (Section \ref{['sec:train']}). Note that the image branch is frozen during the training stage and is discarded during the inference stage.
Figure 3: Feature-level visual guidance. Once we acquire the projected point features $\mathbf{F}_{\mathcal{P'}}$ and image feature $\mathbf{F}_{\mathcal{I}}$, we utilize object foreground map $\mathbf{S}$ to supervise the objections, in which the pretrained unsupervised instance segmentation module $\mathcal{M}_{\mathcal{S}}$ is applied to extract the object foreground map for each annotated 2D bounding box to generate $\mathbf{S}$. In addition, an image-guided KL divergence loss is applied to learn the distribution from the image features.
Figure 4: Output-level visual guidance. The overlap of the projected 3D bounding box with the corresponding 2D bounding box signifies that 2D boxes can serve as supervision signals without 3D annotations. We use GIoU loss to constrain the projected box of the learned 3D box on the image plane with the ground truth 2D object box.
Figure 5: Qualitative visualizations on the KITTI validation set. We provide the predictions on the LiDAR view (left) and bird's eye view (right) in each result. The purple boxes in the LiDAR data and BEV plane indicate the predictions from our VG-W3D. The green and pink boxes on BEV are the ground truth and predictions from the FGR wei2021fgr baseline (without the proposed visual guidance), respectively. We show that our predictions have a closer alignment with the ground truth, while FGR often fails to enclose the point cloud tightly.

Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance

TL;DR

Abstract

Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance

Authors

TL;DR

Abstract

Table of Contents

Figures (5)