OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

Ji Zhang; Yiran Ding; Zixin Liu

OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

Ji Zhang, Yiran Ding, Zixin Liu

TL;DR

OccFusion addresses the robustness and efficiency gaps in 3D occupancy prediction by removing depth estimation from image feature fusion and leveraging a point-to-point LiDAR-camera fusion via deformable attention. It introduces an active coarse-to-fine decoder and a simple, transferable active training strategy to focus learning on hard examples, reducing computation without sacrificing accuracy. Experiments on nuScenes-Occupancy and nuScenes-Occ3D show state-of-the-art or competitive mIoU, with especially large gains for small objects and substantial reductions in computation. The approach offers a practical, generalizable framework for depth-free multi-modal fusion in autonomous driving perception.

Abstract

3D occupancy prediction based on multi-sensor fusion,crucial for a reliable autonomous driving system, enables fine-grained understanding of 3D scenes. Previous fusion-based 3D occupancy predictions relied on depth estimation for processing 2D image features. However, depth estimation is an ill-posed problem, hindering the accuracy and robustness of these methods. Furthermore, fine-grained occupancy prediction demands extensive computational resources. To address these issues, we propose OccFusion, a depth estimation free multi-modal fusion framework. Additionally, we introduce a generalizable active training method and an active decoder that can be applied to any occupancy prediction model, with the potential to enhance their performance. Experiments conducted on nuScenes-Occupancy and nuScenes-Occ3D demonstrate our framework's superior performance. Detailed ablation studies highlight the effectiveness of each proposed method.

OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

TL;DR

Abstract

Paper Structure (23 sections, 5 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 23 sections, 5 equations, 6 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Vision-Based 3D Occupancy Prediction
Feature Fusion of Camera and LiDAR
Active Learning and Hard Example Mining
Method
Overview
3D LiDAR Feature Extraction and LiDAR Point Sampling Algorithm
Camera Feature Extraction and OccFusion: Point-to-Point Multi-modal Feature Fusion
Active Coarse to Fine Pipeline
Active Training Method
Experiments
Experimental Setup
Dataset and Metrics
Implementation Details
...and 8 more sections

Figures (6)

Figure 1: Visualization of our coarse-grained and fine-grained prediction results. The first row shows the ground truth and prediction for two coarse-grained samples, while the second row displays the ground truth and prediction for the same two samples at a fine-grained level. Better viewed when zoomed in.
Figure 2: Comparison of our method with one of the existing SOTA multi-modal baseline iccv02 under challenging samples. The first row compares M-baseline iccv02 with our proposed OccFusion for the coarse occupancy prediction task, while the second row compares M-CONet iccv02 with our Active M-CONet. Better viewed when zoomed in.
Figure 3: The overall architecture of our method. Raw LiDAR points are processed by a 3D encoder to extract voxelized features, which, concatenated with point coordinates, serve as queries. Multi-view image features, obtained directly through a 2D encoder from surround-view images, act as keys. Enhanced point clouds are then subjected to point-to-point fusion, resulting in multi-modal 3D voxel features. An active decoder adaptively refines predictions in challenging areas.
Figure 4: Details of the OccFusion module. After pre-sampling, 3D reference points are projected onto images as (2D) reference points. Note that synthetic point clouds (points within circles) do not contribute to LiDAR feature generation. Due to overlapping fields of view among cameras, a single 3D reference point may correspond to multiple reference points upon projection. Features corresponding to reference points are averaged to derive a feature for each 3D reference point, which are then averaged to obtain a multi-modal feature for a voxel.
Figure 5: Active coarse to fine pipeline. We refine features only for voxels with greater uncertainty.
...and 1 more figures

OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

TL;DR

Abstract

OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (6)