Table of Contents
Fetching ...

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, Jiwen Lu

TL;DR

This work introduces a tri-perspective view (TPV) to represent 3D scenes with three orthogonal planes (top, side, front) and a transformer-based TPVFormer to lift multi-view images into this TPV space. By projecting points onto the three planes and summing sampled features, TPV preserves fine-grained 3D structure with lower complexity than voxel grids. Trained with sparse LiDAR supervision, TPVFormer achieves competitive vision-only performance on LiDAR segmentation and excels in 3D semantic occupancy and semantic scene completion tasks, even at arbitrary test-time resolutions. The approach demonstrates that multi-view TPV representations can effectively model outdoor 3D scenes for autonomous driving with improved efficiency and detail.

Abstract

Modern methods for vision-centric autonomous driving perception widely adopt the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. To address this, we propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively. We employ the attention mechanism to aggregate the image features corresponding to each query in each TPV plane. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels. We demonstrate for the first time that using only camera inputs can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code: https://github.com/wzzheng/TPVFormer.

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

TL;DR

This work introduces a tri-perspective view (TPV) to represent 3D scenes with three orthogonal planes (top, side, front) and a transformer-based TPVFormer to lift multi-view images into this TPV space. By projecting points onto the three planes and summing sampled features, TPV preserves fine-grained 3D structure with lower complexity than voxel grids. Trained with sparse LiDAR supervision, TPVFormer achieves competitive vision-only performance on LiDAR segmentation and excels in 3D semantic occupancy and semantic scene completion tasks, even at arbitrary test-time resolutions. The approach demonstrates that multi-view TPV representations can effectively model outdoor 3D scenes for autonomous driving with improved efficiency and detail.

Abstract

Modern methods for vision-centric autonomous driving perception widely adopt the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. To address this, we propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively. We employ the attention mechanism to aggregate the image features corresponding to each query in each TPV plane. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels. We demonstrate for the first time that using only camera inputs can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code: https://github.com/wzzheng/TPVFormer.
Paper Structure (19 sections, 12 equations, 8 figures, 7 tables)

This paper contains 19 sections, 12 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Given only surround-camera RGB images as inputs, our model (trained using only sparse LiDAR point supervision) can predict the semantic occupancy for all volumes in the 3D space. This task is challenging as it requires both geometric and semantic understandings of the 3D scene. We observe that our model can produce even more comprehensive and consistent volume occupancy than the groundtruth on the validation set (not seen during training) of nuScenes nuscenes. Despite the lack of geometric inputs like LiDAR, our model can accurately identify the 3D positions and sizes of close and distant objects. Particularly, our model even successfully identifies the partially occluded bicycle captured only by two LiDAR points, demonstrating the potential advantage of vision-based 3D semantic occupancy prediction.
  • Figure 2: An overview of our method for 3D semantic occupancy prediction. Taking camera images as inputs, the proposed TPVFormer only uses sparse LiDAR semantic labels for training but can effectively predict the semantic occupancy for all voxels.
  • Figure 3: Comparisons of the proposed TPV representation with voxel and BEV representation. While BEV is more efficient than the voxel representation, it discards the height information and cannot comprehensively describe a 3D scene.
  • Figure 4: Framework of the proposed TPVFormer for 3D semantic occupancy prediction. We employ an image backbone network to extract multi-scale features for multi-camera images. We then perform cross-attention to adaptively lift 2D features to the TPV space and use cross-view hybrid attention to enable the interactions between TPV planes. To predict the semantic occupancy of a point in the 3D space, we apply a lightweight prediction head on the sum of projected features on the three TPV planes.
  • Figure 5: Visualization results on 3D semantic occupancy prediction and nuScenes LiDAR segmentation. Our method can generate more comprehensive prediction results than the LiDAR segmentation ground truth.
  • ...and 3 more figures