VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection

Zihua Liu; Hiroki Sakuma; Masatoshi Okutomi

VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection

Zihua Liu, Hiroki Sakuma, Masatoshi Okutomi

TL;DR

VSRD introduces a fully 2D-supervised pipeline for weakly supervised 3D object detection by learning from multi-view silhouette cues. It represents object surfaces as signed distance fields, decomposing each instance into a cuboid SDF plus a residual distance field learned via a hypernetwork, and renders instance-aware silhouettes for end-to-end optimization of 3D bounding boxes. A bipartite Hungarian matching scheme aligns 3D pseudo labels with 2D ground-truth signals, and per-instance confidence weights are used to improve downstream detector training. Experiments on KITTI-360 show that VSRD outperforms existing weakly supervised methods in both pseudo-label quality and monocular 3D detection accuracy, with strong performance in semi-supervised transfer scenarios, highlighting its practical potential for scalable 3D perception without explicit 3D supervision.

Abstract

Monocular 3D object detection poses a significant challenge in 3D scene understanding due to its inherently ill-posed nature in monocular depth estimation. Existing methods heavily rely on supervised learning using abundant 3D labels, typically obtained through expensive and labor-intensive annotation on LiDAR point clouds. To tackle this problem, we propose a novel weakly supervised 3D object detection framework named VSRD (Volumetric Silhouette Rendering for Detection) to train 3D object detectors without any 3D supervision but only weak 2D supervision. VSRD consists of multi-view 3D auto-labeling and subsequent training of monocular 3D object detectors using the pseudo labels generated in the auto-labeling stage. In the auto-labeling stage, we represent the surface of each instance as a signed distance field (SDF) and render its silhouette as an instance mask through our proposed instance-aware volumetric silhouette rendering. To directly optimize the 3D bounding boxes through rendering, we decompose the SDF of each instance into the SDF of a cuboid and the residual distance field (RDF) that represents the residual from the cuboid. This mechanism enables us to optimize the 3D bounding boxes in an end-to-end manner by comparing the rendered instance masks with the ground truth instance masks. The optimized 3D bounding boxes serve as effective training data for 3D object detection. We conduct extensive experiments on the KITTI-360 dataset, demonstrating that our method outperforms the existing weakly supervised 3D object detection methods. The code is available at https://github.com/skmhrk1209/VSRD.

VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection

TL;DR

Abstract

Paper Structure (51 sections, 23 equations, 16 figures, 7 tables)

This paper contains 51 sections, 23 equations, 16 figures, 7 tables.

Introduction
Related Work
Monocular 3D Object Detection
Weakly Supervised 3D Object Detection
3D Object Detection with Neural Fields
Method
Multi-View 3D Auto-Labeling
Preliminaries
SDF-based Volumetric Rendering
Problem Definition
3D Bounding Box Represented as an SDF
Residual Distance Field
Instance-Aware Volumetric Silhouette Rendering
Loss Functions
Multi-View Projection Loss
...and 36 more sections

Figures (16)

Figure 1: Illustration of our proposed weakly supervised 3D object detection framework, which consists of multi-view 3D auto-labeling and subsequent training of monocular 3D object detectors using the pseudo labels generated in the auto-labeling stage.
Figure 2: Illustration of the pipeline of our proposed multi-view 3D auto-labeling. We represent the surface of each instance as an SDF and decompose it into the SDF of a 3D bounding box and the residual distance field (RDF), which is learned via a hypernetwork. The composed instance SDF is used to render the silhouette of the instance through our proposed instance-aware volumetric silhouette rendering. All the 3D bounding boxes are optimized based on the loss between the rendered and ground truth instance masks.
Figure 3: Illustration of our proposed instance-aware volumetric silhouette rendering. The instance labels are averaged for each sampled point along a ray based on the signed distance to each instance. The averaged instance labels are integrated along the ray based on the SDF-based volume rendering formulation NeuS.
Figure 4: Static Scene
Figure 5: Dynamic Scene
...and 11 more figures

VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection

TL;DR

Abstract

VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (16)