Table of Contents
Fetching ...

Robust 3D Semantic Occupancy Prediction with Calibration-free Spatial Transformation

Zhuangwei Zhuang, Ziyin Wang, Sitao Chen, Lizhao Liu, Hui Luo, Mingkui Tan

TL;DR

This work proposes a calibration-free spatial transformation based on vanilla attention to implicitly model the spatial correspondence, and introduces 2D and 3D auxiliary training tasks to enhance the discrimination power of 2D backbones on spatial, semantic, and texture features.

Abstract

3D semantic occupancy prediction, which seeks to provide accurate and comprehensive representations of environment scenes, is important to autonomous driving systems. For autonomous cars equipped with multi-camera and LiDAR, it is critical to aggregate multi-sensor information into a unified 3D space for accurate and robust predictions. Recent methods are mainly built on the 2D-to-3D transformation that relies on sensor calibration to project the 2D image information into the 3D space. These methods, however, suffer from two major limitations: First, they rely on accurate sensor calibration and are sensitive to the calibration noise, which limits their application in real complex environments. Second, the spatial transformation layers are computationally expensive and limit their running on an autonomous vehicle. In this work, we attempt to exploit a Robust and Efficient 3D semantic Occupancy (REO) prediction scheme. To this end, we propose a calibration-free spatial transformation based on vanilla attention to implicitly model the spatial correspondence. In this way, we robustly project the 2D features to a predefined BEV plane without using sensor calibration as input. Then, we introduce 2D and 3D auxiliary training tasks to enhance the discrimination power of 2D backbones on spatial, semantic, and texture features. Last, we propose a query-based prediction scheme to efficiently generate large-scale fine-grained occupancy predictions. By fusing point clouds that provide complementary spatial information, our REO surpasses the existing methods by a large margin on three benchmarks, including OpenOccupancy, Occ3D-nuScenes, and SemanticKITTI Scene Completion. For instance, our REO achieves 19.8$\times$ speedup compared to Co-Occ, with 1.1 improvements in geometry IoU on OpenOccupancy. Our code will be available at https://github.com/ICEORY/REO.

Robust 3D Semantic Occupancy Prediction with Calibration-free Spatial Transformation

TL;DR

This work proposes a calibration-free spatial transformation based on vanilla attention to implicitly model the spatial correspondence, and introduces 2D and 3D auxiliary training tasks to enhance the discrimination power of 2D backbones on spatial, semantic, and texture features.

Abstract

3D semantic occupancy prediction, which seeks to provide accurate and comprehensive representations of environment scenes, is important to autonomous driving systems. For autonomous cars equipped with multi-camera and LiDAR, it is critical to aggregate multi-sensor information into a unified 3D space for accurate and robust predictions. Recent methods are mainly built on the 2D-to-3D transformation that relies on sensor calibration to project the 2D image information into the 3D space. These methods, however, suffer from two major limitations: First, they rely on accurate sensor calibration and are sensitive to the calibration noise, which limits their application in real complex environments. Second, the spatial transformation layers are computationally expensive and limit their running on an autonomous vehicle. In this work, we attempt to exploit a Robust and Efficient 3D semantic Occupancy (REO) prediction scheme. To this end, we propose a calibration-free spatial transformation based on vanilla attention to implicitly model the spatial correspondence. In this way, we robustly project the 2D features to a predefined BEV plane without using sensor calibration as input. Then, we introduce 2D and 3D auxiliary training tasks to enhance the discrimination power of 2D backbones on spatial, semantic, and texture features. Last, we propose a query-based prediction scheme to efficiently generate large-scale fine-grained occupancy predictions. By fusing point clouds that provide complementary spatial information, our REO surpasses the existing methods by a large margin on three benchmarks, including OpenOccupancy, Occ3D-nuScenes, and SemanticKITTI Scene Completion. For instance, our REO achieves 19.8 speedup compared to Co-Occ, with 1.1 improvements in geometry IoU on OpenOccupancy. Our code will be available at https://github.com/ICEORY/REO.

Paper Structure

This paper contains 24 sections, 17 equations, 11 figures, 18 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparisons of existing methods li2022bevformerli2023voxformerhuang2023tricao2022monoscene and our REO. Unlike existing methods that rely on sensor calibration to compute the reference points of the 3D voxels, our REO directly models the 2D-to-3D spatial correspondence by attention scheme without using sensor calibration. We simplify the architecture diagram for better illustration.
  • Figure 2: Comparisons of model efficiency and performance of different methods on OpenOccupancy.
  • Figure 3: Overview of the proposed Robust and Efficient Occupancy (REO) Prediction. We first extract image features from multi-cameras using a pre-trained 2D encoder. Then, we aggregate the image features with a feature aggregation module. Both multi-cameras and LiDAR features are efficiently transformed to the BEV plane by the calibration-free spatial transformation modules. Third, we introduce 2D/3D auxiliary training tasks to ease the spatial projection and improve the model performance. Last, we use a query-based prediction scheme to efficiently generate predictions of queried voxels sampled from 3D space.
  • Figure 4: Illustration of calibration-free spatial transformation for multi-cameras. PE indicates the positional embeddings. The features from multi-view images are projected to a predefined BEV plane with the vanilla attention scheme. For multi-sensor fusion, the learnable BEV queries are replaced by the projected BEV LiDAR features.
  • Figure 5: Illustration of calibration-free spatial transformation for LiDAR. We assume that the LiDAR coordinate differs from the ground truth coordinate. Therefore, we use REO to project the LiDAR features to the predefined BEV plane with an attention scheme.
  • ...and 6 more figures