Table of Contents
Fetching ...

Inverse++: Vision-Centric 3D Semantic Occupancy Prediction Assisted with 3D Object Detection

Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Stewart Worrall

TL;DR

This paper tackles 3D semantic occupancy prediction for autonomous driving using surround-view cameras. It introduces Inverse++, a vision-centric framework that adds a 3D object detection auxiliary branch to provide a second 3D supervision signal, implemented through a query-based sampling and multi-scale cross-attention mechanism to refine intermediate features. The approach achieves state-of-the-art results on nuScenes, notably excelling at vulnerable road user detection under challenging rain and night conditions, with IoU $=31.73\%$ and mIoU $=20.91\%$, while maintaining competitive efficiency. The work demonstrates that dual 3D supervision signals and targeted feature refinement yield robust 3D occupancy maps that better capture small, dynamic objects critical for driving safety.

Abstract

3D semantic occupancy prediction aims to forecast detailed geometric and semantic information of the surrounding environment for autonomous vehicles (AVs) using onboard surround-view cameras. Existing methods primarily focus on intricate inner structure module designs to improve model performance, such as efficient feature sampling and aggregation processes or intermediate feature representation formats. In this paper, we explore multitask learning by introducing an additional 3D supervision signal by incorporating an additional 3D object detection auxiliary branch. This extra 3D supervision signal enhances the model's overall performance by strengthening the capability of the intermediate features to capture small dynamic objects in the scene, and these small dynamic objects often include vulnerable road users, i.e. bicycles, motorcycles, and pedestrians, whose detection is crucial for ensuring driving safety in autonomous vehicles. Extensive experiments conducted on the nuScenes datasets, including challenging rainy and nighttime scenarios, showcase that our approach attains state-of-the-art results, achieving an IoU score of 31.73% and a mIoU score of 20.91% and excels at detecting vulnerable road users (VRU). The code will be made available at:https://github.com/DanielMing123/Inverse++

Inverse++: Vision-Centric 3D Semantic Occupancy Prediction Assisted with 3D Object Detection

TL;DR

This paper tackles 3D semantic occupancy prediction for autonomous driving using surround-view cameras. It introduces Inverse++, a vision-centric framework that adds a 3D object detection auxiliary branch to provide a second 3D supervision signal, implemented through a query-based sampling and multi-scale cross-attention mechanism to refine intermediate features. The approach achieves state-of-the-art results on nuScenes, notably excelling at vulnerable road user detection under challenging rain and night conditions, with IoU and mIoU , while maintaining competitive efficiency. The work demonstrates that dual 3D supervision signals and targeted feature refinement yield robust 3D occupancy maps that better capture small, dynamic objects critical for driving safety.

Abstract

3D semantic occupancy prediction aims to forecast detailed geometric and semantic information of the surrounding environment for autonomous vehicles (AVs) using onboard surround-view cameras. Existing methods primarily focus on intricate inner structure module designs to improve model performance, such as efficient feature sampling and aggregation processes or intermediate feature representation formats. In this paper, we explore multitask learning by introducing an additional 3D supervision signal by incorporating an additional 3D object detection auxiliary branch. This extra 3D supervision signal enhances the model's overall performance by strengthening the capability of the intermediate features to capture small dynamic objects in the scene, and these small dynamic objects often include vulnerable road users, i.e. bicycles, motorcycles, and pedestrians, whose detection is crucial for ensuring driving safety in autonomous vehicles. Extensive experiments conducted on the nuScenes datasets, including challenging rainy and nighttime scenarios, showcase that our approach attains state-of-the-art results, achieving an IoU score of 31.73% and a mIoU score of 20.91% and excels at detecting vulnerable road users (VRU). The code will be made available at:https://github.com/DanielMing123/Inverse++

Paper Structure

This paper contains 29 sections, 19 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The pipeline of four approaches: the single supervision signal-based, the dual supervision signal-based, and our proposed approach. To introduce an additional 3D supervision signal during training, we incorporate a 3D object detection auxiliary branch.
  • Figure 2: Overall architecture of Inverse++. The pipeline comprises two branches: the main branch includes an image encoder for extracting multi-scale visual features, global and local view transformations to produce intermediate multi-scale global BEV features and 3D feature volumes, global-local attention fusion to yield merged multi-scale 3D feature volumes, and a UNet-like Encoder-Decoder structure for further feature refinement, culminating in the final multi-scale 3D volume logits. The 3D object detection auxiliary branch introduces an extra 3D supervision signal that applies to visual features, multi-scale global BEV features, and multi-scale 3D volume logits. This auxiliary branch enhances the model's capability to effectively capture small dynamic objects.
  • Figure 3: Performance variation trend for 3D semantic occupancy prediction task. (a) mIoU performance variation trend on the whole SurroundOcc-nuScenes validation set, (b) mIoU performance variation trend on the SurroundOcc-nuScenes validation rainy scenario subset, and (c) mIoU performance variation on the SurroundOcc-nuScenes validation night scenario subset. (d) IoU performance variation on the whole SurroundOcc-nuScenes validation set, (e) IoU performance variation on the SurroundOcc-nuScenes validation rainy scenario subset, and (f) IoU performance variation on the SurroundOcc-nuScenes validation night scenario subset. Better viewed when zoomed in.
  • Figure 4: Qualitative results for daytime, rainy, and nighttime scenarios displayed in the upper, middle, and bottom sections, respectively. Better viewed when zoomed in. Notion of modality: Camera (C), Lidar (L), Radar (R).