Inverse++: Vision-Centric 3D Semantic Occupancy Prediction Assisted with 3D Object Detection

Zhenxing Ming; Julie Stephany Berrio; Mao Shan; Stewart Worrall

Inverse++: Vision-Centric 3D Semantic Occupancy Prediction Assisted with 3D Object Detection

Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Stewart Worrall

TL;DR

This paper tackles 3D semantic occupancy prediction for autonomous driving using surround-view cameras. It introduces Inverse++, a vision-centric framework that adds a 3D object detection auxiliary branch to provide a second 3D supervision signal, implemented through a query-based sampling and multi-scale cross-attention mechanism to refine intermediate features. The approach achieves state-of-the-art results on nuScenes, notably excelling at vulnerable road user detection under challenging rain and night conditions, with IoU $=31.73\%$ and mIoU $=20.91\%$, while maintaining competitive efficiency. The work demonstrates that dual 3D supervision signals and targeted feature refinement yield robust 3D occupancy maps that better capture small, dynamic objects critical for driving safety.

Abstract

3D semantic occupancy prediction aims to forecast detailed geometric and semantic information of the surrounding environment for autonomous vehicles (AVs) using onboard surround-view cameras. Existing methods primarily focus on intricate inner structure module designs to improve model performance, such as efficient feature sampling and aggregation processes or intermediate feature representation formats. In this paper, we explore multitask learning by introducing an additional 3D supervision signal by incorporating an additional 3D object detection auxiliary branch. This extra 3D supervision signal enhances the model's overall performance by strengthening the capability of the intermediate features to capture small dynamic objects in the scene, and these small dynamic objects often include vulnerable road users, i.e. bicycles, motorcycles, and pedestrians, whose detection is crucial for ensuring driving safety in autonomous vehicles. Extensive experiments conducted on the nuScenes datasets, including challenging rainy and nighttime scenarios, showcase that our approach attains state-of-the-art results, achieving an IoU score of 31.73% and a mIoU score of 20.91% and excels at detecting vulnerable road users (VRU). The code will be made available at:https://github.com/DanielMing123/Inverse++

Inverse++: Vision-Centric 3D Semantic Occupancy Prediction Assisted with 3D Object Detection

TL;DR

Abstract

Inverse++: Vision-Centric 3D Semantic Occupancy Prediction Assisted with 3D Object Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)