Table of Contents
Fetching ...

MS-Occ: Multi-Stage LiDAR-Camera Fusion for 3D Semantic Occupancy Prediction

Zhiqiang Wei, Lianqing Zheng, Jianan Liu, Tao Huang, Qing-Long Han, Wenwen Zhang, Fengdeng Zhang

TL;DR

MS-Occ addresses the challenge of robust 3D semantic occupancy by integrating geometric fidelity from LiDAR with rich image semantics through a novel multi-stage fusion pipeline. It combines middle-stage fusion (Gaussian-Geo for geometry-enhanced camera features and Semantic-Aware deformable attention to inject semantics into LiDAR voxels) with late-stage voxel fusion (Adaptive Fusion and High Classification Confidence Voxel Fusion) to align and refine multi-modal voxel representations. The approach delivers state-of-the-art results on nuScenes-OpenOccupancy and SemanticKITTI, with notable improvements for small, safety-critical objects such as VRUs, while maintaining parameter efficiency. These results indicate that explicit cross-modal, multi-stage fusion can substantially enhance 3D occupancy perception, which is critical for autonomous driving safety in complex environments.

Abstract

Accurate 3D semantic occupancy perception is essential for autonomous driving in complex environments with diverse and irregular objects. While vision-centric methods suffer from geometric inaccuracies, LiDAR-based approaches often lack rich semantic information. To address these limitations, MS-Occ, a novel multi-stage LiDAR-camera fusion framework which includes middle-stage fusion and late-stage fusion, is proposed, integrating LiDAR's geometric fidelity with camera-based semantic richness via hierarchical cross-modal fusion. The framework introduces innovations at two critical stages: (1) In the middle-stage feature fusion, the Gaussian-Geo module leverages Gaussian kernel rendering on sparse LiDAR depth maps to enhance 2D image features with dense geometric priors, and the Semantic-Aware module enriches LiDAR voxels with semantic context via deformable cross-attention; (2) In the late-stage voxel fusion, the Adaptive Fusion (AF) module dynamically balances voxel features across modalities, while the High Classification Confidence Voxel Fusion (HCCVF) module resolves semantic inconsistencies using self-attention-based refinement. Experiments on two large-scale benchmarks demonstrate state-of-the-art performance. On nuScenes-OpenOccupancy, MS-Occ achieves an Intersection over Union (IoU) of 32.1% and a mean IoU (mIoU) of 25.3%, surpassing the state-of-the-art by +0.7% IoU and +2.4% mIoU. Furthermore, on the SemanticKITTI benchmark, our method achieves a new state-of-the-art mIoU of 24.08%, robustly validating its generalization capabilities.Ablation studies further confirm the effectiveness of each individual module, highlighting substantial improvements in the perception of small objects and reinforcing the practical value of MS-Occ for safety-critical autonomous driving scenarios.

MS-Occ: Multi-Stage LiDAR-Camera Fusion for 3D Semantic Occupancy Prediction

TL;DR

MS-Occ addresses the challenge of robust 3D semantic occupancy by integrating geometric fidelity from LiDAR with rich image semantics through a novel multi-stage fusion pipeline. It combines middle-stage fusion (Gaussian-Geo for geometry-enhanced camera features and Semantic-Aware deformable attention to inject semantics into LiDAR voxels) with late-stage voxel fusion (Adaptive Fusion and High Classification Confidence Voxel Fusion) to align and refine multi-modal voxel representations. The approach delivers state-of-the-art results on nuScenes-OpenOccupancy and SemanticKITTI, with notable improvements for small, safety-critical objects such as VRUs, while maintaining parameter efficiency. These results indicate that explicit cross-modal, multi-stage fusion can substantially enhance 3D occupancy perception, which is critical for autonomous driving safety in complex environments.

Abstract

Accurate 3D semantic occupancy perception is essential for autonomous driving in complex environments with diverse and irregular objects. While vision-centric methods suffer from geometric inaccuracies, LiDAR-based approaches often lack rich semantic information. To address these limitations, MS-Occ, a novel multi-stage LiDAR-camera fusion framework which includes middle-stage fusion and late-stage fusion, is proposed, integrating LiDAR's geometric fidelity with camera-based semantic richness via hierarchical cross-modal fusion. The framework introduces innovations at two critical stages: (1) In the middle-stage feature fusion, the Gaussian-Geo module leverages Gaussian kernel rendering on sparse LiDAR depth maps to enhance 2D image features with dense geometric priors, and the Semantic-Aware module enriches LiDAR voxels with semantic context via deformable cross-attention; (2) In the late-stage voxel fusion, the Adaptive Fusion (AF) module dynamically balances voxel features across modalities, while the High Classification Confidence Voxel Fusion (HCCVF) module resolves semantic inconsistencies using self-attention-based refinement. Experiments on two large-scale benchmarks demonstrate state-of-the-art performance. On nuScenes-OpenOccupancy, MS-Occ achieves an Intersection over Union (IoU) of 32.1% and a mean IoU (mIoU) of 25.3%, surpassing the state-of-the-art by +0.7% IoU and +2.4% mIoU. Furthermore, on the SemanticKITTI benchmark, our method achieves a new state-of-the-art mIoU of 24.08%, robustly validating its generalization capabilities.Ablation studies further confirm the effectiveness of each individual module, highlighting substantial improvements in the perception of small objects and reinforcing the practical value of MS-Occ for safety-critical autonomous driving scenarios.

Paper Structure

This paper contains 21 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of occupancy fusion stages in encoder design. (a) Middle-stage fusion (e.g., occloff), where 2D image features are directly fused with 3D LiDAR voxel features. (b) Late-stage fusion, including methods such as wang2023openoccupancyCo-OccOccGenEFFOcc, which explicitly lift 2D image features to 3D voxel features (e.g., using lss), followed by 3D fusion in voxel space. (c) Multi-stage fusion, as proposed in our MS-Occ, integrates both middle and late-stage fusion within a hybrid architecture.
  • Figure 2: The overall framework of MS-Occ, with the four proposed modules highlighted in light yellow. The pipeline consists of two main stages: middle-stage fusion and late-stage fusion. In the middle-stage fusion, camera and LiDAR features are fused to produce geometry-enhanced image features via the Gaussian-Geo module, and semantically enriched LiDAR features via the Semantic-Aware module. Subsequently, in the late-stage fusion, the AF module and the HCCVF module are applied in parallel to integrate the fused representations and generate the final 3D semantic occupancy grid of the scene.
  • Figure 3: Illustration of intermediate outputs from the proposed Gaussian-Geo module. The LiDAR point cloud is densified using Gaussian kernel, and the resulting geometric information is subsequently transferred to the image to enhance its spatial representation.
  • Figure 4: Illustration of the proposed Semantic-Aware module. Semantic information from the camera modality, aligned with geometric features, is transferred to the LiDAR modality to enrich voxel-level representations.
  • Figure 5: Visualization results on the nuScenes-OpenOccupancy dataset nuscenes2019wang2023openoccupancy. The leftmost column shows the surround-view images. The next three columns present 3D semantic occupancy predictions from M-CONet, MS-Occ (ours), and the ground truth. Please zoom in for finer details.