Table of Contents
Fetching ...

VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering

Zihua Liu, Hiroki Sakuma, Masatoshi Okutomi

TL;DR

VSRD++ addresses the high annotation cost of monocular 3D object detection by a two-stage weakly supervised pipeline that autolabels 3D bounding boxes from multi-view 2D supervision. It introduces an instance-aware volumetric silhouette rendering framework built on SDFs, decomposed into a cuboid SDF plus a residual distance field, with velocity modeled to handle dynamics. A 3D attribute initialization pipeline and a confidence-based loss weighting scheme enhance pseudo-label quality, enabling effective training of monocular detectors. Experiments on KITTI-360 show substantial improvements over prior weakly supervised methods, especially in dynamic scenes, demonstrating practical potential for scalable 3D perception without 3D ground truth.

Abstract

Monocular 3D object detection is a fundamental yet challenging task in 3D scene understanding. Existing approaches heavily depend on supervised learning with extensive 3D annotations, which are often acquired from LiDAR point clouds through labor-intensive labeling processes. To tackle this problem, we propose VSRD++, a novel weakly supervised framework for monocular 3D object detection that eliminates the reliance on 3D annotations and leverages neural-field-based volumetric rendering with weak 2D supervision. VSRD++ consists of a two-stage pipeline: multi-view 3D autolabeling and subsequent monocular 3D detector training. In the multi-view autolabeling stage, object surfaces are represented as signed distance fields (SDFs) and rendered as instance masks via the proposed instance-aware volumetric silhouette rendering. To optimize 3D bounding boxes, we decompose each instance's SDF into a cuboid SDF and a residual distance field (RDF) that captures deviations from the cuboid. To address the geometry inconsistency commonly observed in volume rendering methods applied to dynamic objects, we model the dynamic objects by including velocity into bounding box attributes as well as assigning confidence to each pseudo-label. Moreover, we also employ a 3D attribute initialization module to initialize the dynamic bounding box parameters. In the monocular 3D object detection phase, the optimized 3D bounding boxes serve as pseudo labels for training monocular 3D object detectors. Extensive experiments on the KITTI-360 dataset demonstrate that VSRD++ significantly outperforms existing weakly supervised approaches for monocular 3D object detection on both static and dynamic scenes. Code is available at https://github.com/Magicboomliu/VSRD_plus_plus

VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering

TL;DR

VSRD++ addresses the high annotation cost of monocular 3D object detection by a two-stage weakly supervised pipeline that autolabels 3D bounding boxes from multi-view 2D supervision. It introduces an instance-aware volumetric silhouette rendering framework built on SDFs, decomposed into a cuboid SDF plus a residual distance field, with velocity modeled to handle dynamics. A 3D attribute initialization pipeline and a confidence-based loss weighting scheme enhance pseudo-label quality, enabling effective training of monocular detectors. Experiments on KITTI-360 show substantial improvements over prior weakly supervised methods, especially in dynamic scenes, demonstrating practical potential for scalable 3D perception without 3D ground truth.

Abstract

Monocular 3D object detection is a fundamental yet challenging task in 3D scene understanding. Existing approaches heavily depend on supervised learning with extensive 3D annotations, which are often acquired from LiDAR point clouds through labor-intensive labeling processes. To tackle this problem, we propose VSRD++, a novel weakly supervised framework for monocular 3D object detection that eliminates the reliance on 3D annotations and leverages neural-field-based volumetric rendering with weak 2D supervision. VSRD++ consists of a two-stage pipeline: multi-view 3D autolabeling and subsequent monocular 3D detector training. In the multi-view autolabeling stage, object surfaces are represented as signed distance fields (SDFs) and rendered as instance masks via the proposed instance-aware volumetric silhouette rendering. To optimize 3D bounding boxes, we decompose each instance's SDF into a cuboid SDF and a residual distance field (RDF) that captures deviations from the cuboid. To address the geometry inconsistency commonly observed in volume rendering methods applied to dynamic objects, we model the dynamic objects by including velocity into bounding box attributes as well as assigning confidence to each pseudo-label. Moreover, we also employ a 3D attribute initialization module to initialize the dynamic bounding box parameters. In the monocular 3D object detection phase, the optimized 3D bounding boxes serve as pseudo labels for training monocular 3D object detectors. Extensive experiments on the KITTI-360 dataset demonstrate that VSRD++ significantly outperforms existing weakly supervised approaches for monocular 3D object detection on both static and dynamic scenes. Code is available at https://github.com/Magicboomliu/VSRD_plus_plus

Paper Structure

This paper contains 38 sections, 17 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Illustration of our proposed two-stage weakly supervised 3D object detection framework, which consists of multi-view 3D auto-labeling and subsequent training of monocular 3D object detectors using the pseudo labels generated in the auto-labeling stage.
  • Figure 2: Illustration of the pipeline for our proposed multi-view 3D auto-labeling framework, VSRD++. Each instance surface is represented as a signed distance field (SDF) and optimized through a Deformable SDF Learner. To initialize deformable 3D bounding box attributes (e.g., location, orientation, velocity), we employ the 3D attribute initialization module. The time-variant SDF modeling incorporates velocity to decouple instance SDF dynamics. The composed instance SDF enables silhouette rendering via Instance-Aware Volumetric Silhouette Rendering. All 3D bounding boxes are optimized by minimizing the loss between the rendered and ground truth instance masks.
  • Figure 3: Illustration of the instance SDF decomposition, where we decouple the surface of the cars into the combination of the cuboid box SDF which is represented by the 3D bounding boxes and the spatial residual from the cuboid. Blue arrow presented for the cuboid box SDF parameterized by the 3D bounding boxes, where the red arrow demonstrate the residual RDF and the purple arrow represents the instance SDF.
  • Figure 4: Illustration of our proposed instance-aware volumetric silhouette rendering. The instance labels are averaged for each sampled point along a ray based on the signed distance to each instance. The averaged instance labels are integrated along the ray based on the SDF-based volume rendering formulation NeuS.
  • Figure 5: Pipeline of Time-Variant SDF Based on Velocity-Incorporated Bounding Boxes. We represent the surface of each instance as an SDF and decompose it into the SDF of a 3D bounding box and the residual distance field (RDF), which is learned via a hypernetwork. For each time $t$, we employ relative time duration $\Delta t$ and the velocity $v_{n}$ to obtain the time-dependent box residual to adjust the box SDF field adaptively using the dynamic mask $\tilde{M_{n}}$.
  • ...and 7 more figures