Table of Contents
Fetching ...

TEOcc: Radar-camera Multi-modal Occupancy Prediction via Temporal Enhancement

Zhiwei Lin, Hongbo Jin, Yongtao Wang, Yufei Wei, Nan Dong

TL;DR

A radar-camera multi-modal temporal enhanced occupancy prediction network, dubbed TEOcc, inspired by the success of utilizing temporal information in 3D object detection, which achieves state-of-the-art occupancy prediction on nuScenes benchmarks.

Abstract

As a novel 3D scene representation, semantic occupancy has gained much attention in autonomous driving. However, existing occupancy prediction methods mainly focus on designing better occupancy representations, such as tri-perspective view or neural radiance fields, while ignoring the advantages of using long-temporal information. In this paper, we propose a radar-camera multi-modal temporal enhanced occupancy prediction network, dubbed TEOcc. Our method is inspired by the success of utilizing temporal information in 3D object detection. Specifically, we introduce a temporal enhancement branch to learn temporal occupancy prediction. In this branch, we randomly discard the t-k input frame of the multi-view camera and predict its 3D occupancy by long-term and short-term temporal decoders separately with the information from other adjacent frames and multi-modal inputs. Besides, to reduce computational costs and incorporate multi-modal inputs, we specially designed 3D convolutional layers for long-term and short-term temporal decoders. Furthermore, since the lightweight occupancy prediction head is a dense classification head, we propose to use a shared occupancy prediction head for the temporal enhancement and main branches. It is worth noting that the temporal enhancement branch is only performed during training and is discarded during inference. Experiment results demonstrate that TEOcc achieves state-of-the-art occupancy prediction on nuScenes benchmarks. In addition, the proposed temporal enhancement branch is a plug-and-play module that can be easily integrated into existing occupancy prediction methods to improve the performance of occupancy prediction. The code and models will be released at https://github.com/VDIGPKU/TEOcc.

TEOcc: Radar-camera Multi-modal Occupancy Prediction via Temporal Enhancement

TL;DR

A radar-camera multi-modal temporal enhanced occupancy prediction network, dubbed TEOcc, inspired by the success of utilizing temporal information in 3D object detection, which achieves state-of-the-art occupancy prediction on nuScenes benchmarks.

Abstract

As a novel 3D scene representation, semantic occupancy has gained much attention in autonomous driving. However, existing occupancy prediction methods mainly focus on designing better occupancy representations, such as tri-perspective view or neural radiance fields, while ignoring the advantages of using long-temporal information. In this paper, we propose a radar-camera multi-modal temporal enhanced occupancy prediction network, dubbed TEOcc. Our method is inspired by the success of utilizing temporal information in 3D object detection. Specifically, we introduce a temporal enhancement branch to learn temporal occupancy prediction. In this branch, we randomly discard the t-k input frame of the multi-view camera and predict its 3D occupancy by long-term and short-term temporal decoders separately with the information from other adjacent frames and multi-modal inputs. Besides, to reduce computational costs and incorporate multi-modal inputs, we specially designed 3D convolutional layers for long-term and short-term temporal decoders. Furthermore, since the lightweight occupancy prediction head is a dense classification head, we propose to use a shared occupancy prediction head for the temporal enhancement and main branches. It is worth noting that the temporal enhancement branch is only performed during training and is discarded during inference. Experiment results demonstrate that TEOcc achieves state-of-the-art occupancy prediction on nuScenes benchmarks. In addition, the proposed temporal enhancement branch is a plug-and-play module that can be easily integrated into existing occupancy prediction methods to improve the performance of occupancy prediction. The code and models will be released at https://github.com/VDIGPKU/TEOcc.

Paper Structure

This paper contains 15 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Differences between HoP and the proposed TEOcc. TEOcc uses independent long-term and short-term temporal decoders for 3D voxel feature generation and a shared head for occupancy prediction. Besides, TEOcc can incorporate radar-camera multi-modal inputs.
  • Figure 2: Overall pipeline of TEOcc. First, multi-frame multi-view camera features are extracted with an image encoder. The extracted 2D image features are transformed into 3D image voxel features with a 2D-3D view transformation module. Parallelly, we use a radar encoder and voxel encoder to extract radar voxel features. After that, in the main branch, all temporal image voxel features and radar voxel features are kept to predict final occupancy results. In the temporal enhancement branch, we discard one image voxel feature and use long-term and short-term decoders to generate corresponding pseudo features. Finally, a shared occupancy head is used to predict occupancy from generated pseudo voxel features.
  • Figure 3: Architecture of the temporal enhancement module. The long-term temporal decoder consists of a ResNet-3D backbone and a FPN-3D neck to process multi-scale 3D voxel features. The short-term decoder is composed of two 3D convolution layers.
  • Figure 4: Architecture of ResNet-3D. ResNet-3D has three stages. Each stage consists of several 3D BasicBlocks.
  • Figure 5: Architecture of FPN-3D. We upsample multi-scale voxel features into one scale and fuse them with a 3D convolution layer.
  • ...and 2 more figures