Table of Contents
Fetching ...

MR-Occ: Efficient Camera-LiDAR 3D Semantic Occupancy Prediction Using Hierarchical Multi-Resolution Voxel Representation

Minjae Seong, Jisong Kim, Geonho Bang, Hawook Jeong, Jun Won Choi

TL;DR

This work addresses the efficiency-robustness trade-off in 3D semantic occupancy prediction for autonomous driving by proposing MR-Occ, a camera-LiDAR fusion framework. It introduces three key components: Pixel to Voxel Fusion Network (PVF-Net) for deformable-attention-based, densified LiDAR-guided fusion with camera features; Hierarchical Voxel Feature Refinement (HVFR) to selectively refine core voxels at multiple resolutions; and a Multi-scale Occupancy Decoder (MOD) that adds an Occluded state to handle regions not visible to sensors. The approach yields state-of-the-art results on the nuScenes-Occupancy dataset with +5.2% IoU and +5.3% mIoU gains while using fewer parameters and FLOPs, and also achieves strong performance on SemanticKITTI, demonstrating generalization across benchmarks. By explicitly modeling occlusion and focusing computation on informative voxels, MR-Occ offers a practically efficient and robust solution for 3D semantic occupancy in urban environments.

Abstract

Accurate 3D perception is essential for understanding the environment in autonomous driving. Recent advancements in 3D semantic occupancy prediction have leveraged camera-LiDAR fusion to improve robustness and accuracy. However, current methods allocate computational resources uniformly across all voxels, leading to inefficiency, and they also fail to adequately address occlusions, resulting in reduced accuracy in challenging scenarios. We propose MR-Occ, a novel approach for camera-LiDAR fusion-based 3D semantic occupancy prediction, addressing these challenges through three key components: Hierarchical Voxel Feature Refinement (HVFR), Multi-scale Occupancy Decoder (MOD), and Pixel to Voxel Fusion Network (PVF-Net). HVFR improves performance by enhancing features for critical voxels, reducing computational cost. MOD introduces an `occluded' class to better handle regions obscured from sensor view, improving accuracy. PVF-Net leverages densified LiDAR features to effectively fuse camera and LiDAR data through a deformable attention mechanism. Extensive experiments demonstrate that MR-Occ achieves state-of-the-art performance on the nuScenes-Occupancy dataset, surpassing previous approaches by +5.2% in IoU and +5.3% in mIoU while using fewer parameters and FLOPs. Moreover, MR-Occ demonstrates superior performance on the SemanticKITTI dataset, further validating its effectiveness and generalizability across diverse 3D semantic occupancy benchmarks.

MR-Occ: Efficient Camera-LiDAR 3D Semantic Occupancy Prediction Using Hierarchical Multi-Resolution Voxel Representation

TL;DR

This work addresses the efficiency-robustness trade-off in 3D semantic occupancy prediction for autonomous driving by proposing MR-Occ, a camera-LiDAR fusion framework. It introduces three key components: Pixel to Voxel Fusion Network (PVF-Net) for deformable-attention-based, densified LiDAR-guided fusion with camera features; Hierarchical Voxel Feature Refinement (HVFR) to selectively refine core voxels at multiple resolutions; and a Multi-scale Occupancy Decoder (MOD) that adds an Occluded state to handle regions not visible to sensors. The approach yields state-of-the-art results on the nuScenes-Occupancy dataset with +5.2% IoU and +5.3% mIoU gains while using fewer parameters and FLOPs, and also achieves strong performance on SemanticKITTI, demonstrating generalization across benchmarks. By explicitly modeling occlusion and focusing computation on informative voxels, MR-Occ offers a practically efficient and robust solution for 3D semantic occupancy in urban environments.

Abstract

Accurate 3D perception is essential for understanding the environment in autonomous driving. Recent advancements in 3D semantic occupancy prediction have leveraged camera-LiDAR fusion to improve robustness and accuracy. However, current methods allocate computational resources uniformly across all voxels, leading to inefficiency, and they also fail to adequately address occlusions, resulting in reduced accuracy in challenging scenarios. We propose MR-Occ, a novel approach for camera-LiDAR fusion-based 3D semantic occupancy prediction, addressing these challenges through three key components: Hierarchical Voxel Feature Refinement (HVFR), Multi-scale Occupancy Decoder (MOD), and Pixel to Voxel Fusion Network (PVF-Net). HVFR improves performance by enhancing features for critical voxels, reducing computational cost. MOD introduces an `occluded' class to better handle regions obscured from sensor view, improving accuracy. PVF-Net leverages densified LiDAR features to effectively fuse camera and LiDAR data through a deformable attention mechanism. Extensive experiments demonstrate that MR-Occ achieves state-of-the-art performance on the nuScenes-Occupancy dataset, surpassing previous approaches by +5.2% in IoU and +5.3% in mIoU while using fewer parameters and FLOPs. Moreover, MR-Occ demonstrates superior performance on the SemanticKITTI dataset, further validating its effectiveness and generalizability across diverse 3D semantic occupancy benchmarks.
Paper Structure (18 sections, 13 equations, 5 figures, 9 tables)

This paper contains 18 sections, 13 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Accuracy vs Efficiency (Params/FLOPs) on nuScenes-Occupancy validation set. MR-Occ achieves state-of-the-art performance with less computational cost than previous methods.
  • Figure 2: Overall architecture of MR-Occ: Camera and LiDAR features are extracted from modality-specific backbone networks. The PVF-Net densifies LiDAR features and adaptively fuses them with image features using a deformable cross-attention mechanism. The HVFR module uses Resolution Importance Estimator (RIE) to identify core voxels, and then enhances the fused features through Multi-Resolution Feature Refinement using these core voxels. Finally, Multi-scale Occupancy Decoder (MOD) predicts an 'occluded' class for occluded areas and performs fine-grained occupancy prediction.
  • Figure 3: Multi-Resolution Feature Refinement module. The subdivided core voxels combine features sampled from the same resolution LiDAR features and camera features to capture fine-grained details. The multi-resolution features are fused based on a 3D sparse convolution.
  • Figure 4: Qualitative results comparing MR-Occ and M-CONet predictions: The red boxes highlight areas where MR-Occ shows improved accuracy in detecting objects, particularly in object boundary regions, occlusion scenarios and small objects.
  • Figure 5: Qualitative comparison results on SemanticKITTI validation set. The regions highlighted by orange circles indicate areas with obvious differences.