Table of Contents
Fetching ...

Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts

Jianhao Li, Tianyu Sun, Zhongdao Wang, Enze Xie, Bailan Feng, Hongbo Zhang, Ze Yuan, Ke Xu, Jiaheng Liu, Ping Luo

TL;DR

The paper tackles automatic 3D labeling from 2D prompts in autonomous driving by introducing SLF, a training-free pipeline that segments 2D prompts into instance masks using SAM, lifts masks to 3D via a PCA-based shape prior encoded as an SDF, and refines pose and shape through gradient descent using a differentiable renderer. The approach optimizes a composite energy over mask alignment, LiDAR point alignment, and ground alignment to recover a 3D shape and heading for each object, with $\boldsymbol{p}=(x,y,z,\theta)\in\mathbb{R}^4$ and $\boldsymbol{s}\in\mathbb{R}^d$ ($d=5$). A PCA-based shape prior built from Apollo-Car3D models provides a compact, controllable latent space for 3D vehicle shapes, enabling detailed occupancy and shape prediction without dataset-specific training. Experiments on KITTI show SLF achieving near 90% AP@0.5 IoU for BEV/3D detections and producing pseudo-labels that train detectors almost as well as ground truth; cross-dataset evaluation on nuScenes confirms strong generalization, and qualitative occupancy results indicate potential for dynamic-object occupancy annotation. The method, while powerful, relies on a category-specific shape prior and thus requires CAD models for new classes, pointing to future work on broadening category coverage and shape priors.

Abstract

This paper proposes an algorithm for automatically labeling 3D objects from 2D point or box prompts, especially focusing on applications in autonomous driving. Unlike previous arts, our auto-labeler predicts 3D shapes instead of bounding boxes and does not require training on a specific dataset. We propose a Segment, Lift, and Fit (SLF) paradigm to achieve this goal. Firstly, we segment high-quality instance masks from the prompts using the Segment Anything Model (SAM) and transform the remaining problem into predicting 3D shapes from given 2D masks. Due to the ill-posed nature of this problem, it presents a significant challenge as multiple 3D shapes can project into an identical mask. To tackle this issue, we then lift 2D masks to 3D forms and employ gradient descent to adjust their poses and shapes until the projections fit the masks and the surfaces conform to surrounding LiDAR points. Notably, since we do not train on a specific dataset, the SLF auto-labeler does not overfit to biased annotation patterns in the training set as other methods do. Thus, the generalization ability across different datasets improves. Experimental results on the KITTI dataset demonstrate that the SLF auto-labeler produces high-quality bounding box annotations, achieving an AP@0.5 IoU of nearly 90\%. Detectors trained with the generated pseudo-labels perform nearly as well as those trained with actual ground-truth annotations. Furthermore, the SLF auto-labeler shows promising results in detailed shape predictions, providing a potential alternative for the occupancy annotation of dynamic objects.

Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts

TL;DR

The paper tackles automatic 3D labeling from 2D prompts in autonomous driving by introducing SLF, a training-free pipeline that segments 2D prompts into instance masks using SAM, lifts masks to 3D via a PCA-based shape prior encoded as an SDF, and refines pose and shape through gradient descent using a differentiable renderer. The approach optimizes a composite energy over mask alignment, LiDAR point alignment, and ground alignment to recover a 3D shape and heading for each object, with and (). A PCA-based shape prior built from Apollo-Car3D models provides a compact, controllable latent space for 3D vehicle shapes, enabling detailed occupancy and shape prediction without dataset-specific training. Experiments on KITTI show SLF achieving near 90% AP@0.5 IoU for BEV/3D detections and producing pseudo-labels that train detectors almost as well as ground truth; cross-dataset evaluation on nuScenes confirms strong generalization, and qualitative occupancy results indicate potential for dynamic-object occupancy annotation. The method, while powerful, relies on a category-specific shape prior and thus requires CAD models for new classes, pointing to future work on broadening category coverage and shape priors.

Abstract

This paper proposes an algorithm for automatically labeling 3D objects from 2D point or box prompts, especially focusing on applications in autonomous driving. Unlike previous arts, our auto-labeler predicts 3D shapes instead of bounding boxes and does not require training on a specific dataset. We propose a Segment, Lift, and Fit (SLF) paradigm to achieve this goal. Firstly, we segment high-quality instance masks from the prompts using the Segment Anything Model (SAM) and transform the remaining problem into predicting 3D shapes from given 2D masks. Due to the ill-posed nature of this problem, it presents a significant challenge as multiple 3D shapes can project into an identical mask. To tackle this issue, we then lift 2D masks to 3D forms and employ gradient descent to adjust their poses and shapes until the projections fit the masks and the surfaces conform to surrounding LiDAR points. Notably, since we do not train on a specific dataset, the SLF auto-labeler does not overfit to biased annotation patterns in the training set as other methods do. Thus, the generalization ability across different datasets improves. Experimental results on the KITTI dataset demonstrate that the SLF auto-labeler produces high-quality bounding box annotations, achieving an AP@0.5 IoU of nearly 90\%. Detectors trained with the generated pseudo-labels perform nearly as well as those trained with actual ground-truth annotations. Furthermore, the SLF auto-labeler shows promising results in detailed shape predictions, providing a potential alternative for the occupancy annotation of dynamic objects.
Paper Structure (15 sections, 4 equations, 13 figures, 3 tables)

This paper contains 15 sections, 4 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Comparison between human annotation and auto-labeler annotation using SLF.
  • Figure 2: (a) We aim to recover the 3D shape and pose of an interested object, given 2D box or point as prompts. This problem is highly ill-posed. (b) In our SLF, we propose to firstly segment the 2D mask of the target object, and then lift the mask to a 3D form and optimize over its shape and pose by gradient descent until the 3D object fits the mask and LiDAR points.
  • Figure 3: We perform PCA on a diverse collection of 3D shapes to obtain the shape latent code. Left: Example of the basis 3D shapes in the collection. Right: Interpolation along the latent space of the shape code (each row) shows smooth shape variations.
  • Figure 4: Generation of occlusion map $O$ for the mask alignment objective.
  • Figure 5: Qualitative results of estimated shapes. From left to right: projected mask of the initial 3D model; projected mask of the optimized 3D model; mean shape and surrounding point cloud; optimized 3D shape and surrounding point cloud.
  • ...and 8 more figures