Table of Contents
Fetching ...

SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field

Lizhe Liu, Bohua Wang, Hongwei Xie, Daqi Liu, Li Liu, Zhiqiang Tian, Kuiyuan Yang, Bing Wang

TL;DR

This paper proposes SurroundSDF to implicitly predict the signed distance field (SDF) and semantic field for the continuous perception from surround images and introduces a novel weakly supervised paradigm for SDF, referred to as the Sandwich Eikonal formulation, which enhances the perceptual accuracy of the surface.

Abstract

Vision-centric 3D environment understanding is both vital and challenging for autonomous driving systems. Recently, object-free methods have attracted considerable attention. Such methods perceive the world by predicting the semantics of discrete voxel grids but fail to construct continuous and accurate obstacle surfaces. To this end, in this paper, we propose SurroundSDF to implicitly predict the signed distance field (SDF) and semantic field for the continuous perception from surround images. Specifically, we introduce a query-based approach and utilize SDF constrained by the Eikonal formulation to accurately describe the surfaces of obstacles. Furthermore, considering the absence of precise SDF ground truth, we propose a novel weakly supervised paradigm for SDF, referred to as the Sandwich Eikonal formulation, which emphasizes applying correct and dense constraints on both sides of the surface, thereby enhancing the perceptual accuracy of the surface. Experiments suggest that our method achieves SOTA for both occupancy prediction and 3D scene reconstruction tasks on the nuScenes dataset.

SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field

TL;DR

This paper proposes SurroundSDF to implicitly predict the signed distance field (SDF) and semantic field for the continuous perception from surround images and introduces a novel weakly supervised paradigm for SDF, referred to as the Sandwich Eikonal formulation, which enhances the perceptual accuracy of the surface.

Abstract

Vision-centric 3D environment understanding is both vital and challenging for autonomous driving systems. Recently, object-free methods have attracted considerable attention. Such methods perceive the world by predicting the semantics of discrete voxel grids but fail to construct continuous and accurate obstacle surfaces. To this end, in this paper, we propose SurroundSDF to implicitly predict the signed distance field (SDF) and semantic field for the continuous perception from surround images. Specifically, we introduce a query-based approach and utilize SDF constrained by the Eikonal formulation to accurately describe the surfaces of obstacles. Furthermore, considering the absence of precise SDF ground truth, we propose a novel weakly supervised paradigm for SDF, referred to as the Sandwich Eikonal formulation, which emphasizes applying correct and dense constraints on both sides of the surface, thereby enhancing the perceptual accuracy of the surface. Experiments suggest that our method achieves SOTA for both occupancy prediction and 3D scene reconstruction tasks on the nuScenes dataset.
Paper Structure (24 sections, 13 equations, 8 figures, 7 tables)

This paper contains 24 sections, 13 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: The scene perspective results from surround images input. a, b. The result of SIREN and LODE supervision with camera-only input (the original methods are point cloud-based). c. Our result. d. Our result with semantics.
  • Figure 2: a. Ideal supervision for SDF where the SDF GT is provided. b. SIREN supervision which uses LiDAR points to supervise the surface. c. LODE supervision based on occupancy GT. d. Our supervision paradigm which combines the LiDAR GT and occupancy GT, it is closer to the ideal supervision.
  • Figure 3: The architecture of our SurroundSDF. Given the surround images as input, an encoder composed of a 2D backbone, LSS module, and BEV backbone, is employed to extract voxel features. We adopt a query-based approach to sample features from the voxel features. Specifically, first, a set of query coordinates in the region of interest is selected. Subsequently, using trilinear interpolation, semantic features are queried from the voxel features. Finally, after concatenation with the positional embeddings from the query coordinates, the features pass through the SDF head and semantic head respectively, yielding SDF and semantic fields. For training, the query coordinates are sampled according to the GT, and the SDF and semantic field are supervised by the losses introduced in Section \ref{['section:Loss']}. In the inference phase, based on appropriate sampling and post-processing, continuous and accurate scene perception results are obtained (see Section \ref{['Sec.inference']}).
  • Figure 4: a. SDF constraints with Sandwich Eikonal formulation in continuous form. b. SDF sampling in discrete form.
  • Figure 5: Variation of occupancy IoU and semantic mIoU with SDF Threshold. Note that the peak values of these two indicators correspond to different SDF thresholds.
  • ...and 3 more figures