VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering

Zihua Liu; Hiroki Sakuma; Masatoshi Okutomi

VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering

Zihua Liu, Hiroki Sakuma, Masatoshi Okutomi

TL;DR

VSRD++ addresses the high annotation cost of monocular 3D object detection by a two-stage weakly supervised pipeline that autolabels 3D bounding boxes from multi-view 2D supervision. It introduces an instance-aware volumetric silhouette rendering framework built on SDFs, decomposed into a cuboid SDF plus a residual distance field, with velocity modeled to handle dynamics. A 3D attribute initialization pipeline and a confidence-based loss weighting scheme enhance pseudo-label quality, enabling effective training of monocular detectors. Experiments on KITTI-360 show substantial improvements over prior weakly supervised methods, especially in dynamic scenes, demonstrating practical potential for scalable 3D perception without 3D ground truth.

Abstract

Monocular 3D object detection is a fundamental yet challenging task in 3D scene understanding. Existing approaches heavily depend on supervised learning with extensive 3D annotations, which are often acquired from LiDAR point clouds through labor-intensive labeling processes. To tackle this problem, we propose VSRD++, a novel weakly supervised framework for monocular 3D object detection that eliminates the reliance on 3D annotations and leverages neural-field-based volumetric rendering with weak 2D supervision. VSRD++ consists of a two-stage pipeline: multi-view 3D autolabeling and subsequent monocular 3D detector training. In the multi-view autolabeling stage, object surfaces are represented as signed distance fields (SDFs) and rendered as instance masks via the proposed instance-aware volumetric silhouette rendering. To optimize 3D bounding boxes, we decompose each instance's SDF into a cuboid SDF and a residual distance field (RDF) that captures deviations from the cuboid. To address the geometry inconsistency commonly observed in volume rendering methods applied to dynamic objects, we model the dynamic objects by including velocity into bounding box attributes as well as assigning confidence to each pseudo-label. Moreover, we also employ a 3D attribute initialization module to initialize the dynamic bounding box parameters. In the monocular 3D object detection phase, the optimized 3D bounding boxes serve as pseudo labels for training monocular 3D object detectors. Extensive experiments on the KITTI-360 dataset demonstrate that VSRD++ significantly outperforms existing weakly supervised approaches for monocular 3D object detection on both static and dynamic scenes. Code is available at https://github.com/Magicboomliu/VSRD_plus_plus

VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering

TL;DR

Abstract

VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)