Table of Contents
Fetching ...

Co-Occ: Coupling Explicit Feature Fusion with Volume Rendering Regularization for Multi-Modal 3D Semantic Occupancy Prediction

Jingyi Pan, Zipeng Wang, Lin Wang

TL;DR

Co-Occ tackles multi-modal 3D semantic occupancy prediction by tightly coupling explicit LiDAR-camera feature fusion with implicit volume rendering regularization. The Geometric- and Semantic-aware Fusion (GSFusion) leverages a KNN-based neighborhood and a learnable gate to fuse camera semantics into sparse LiDAR features within a unified voxel space, producing enhanced fused representations. An auxiliary volume rendering pathway supervises color and depth in the feature space during training, bridging 3D LiDAR sweeps and 2D images and regularizing the fused features without impacting inference. Across nuScenes and SemanticKITTI, Co-Occ achieves state-of-the-art results, validating the effectiveness of combining explicit cross-modal fusion with NeRF-inspired regularization for dense, accurate 3D semantic occupancy predictions.

Abstract

3D semantic occupancy prediction is a pivotal task in the field of autonomous driving. Recent approaches have made great advances in 3D semantic occupancy predictions on a single modality. However, multi-modal semantic occupancy prediction approaches have encountered difficulties in dealing with the modality heterogeneity, modality misalignment, and insufficient modality interactions that arise during the fusion of different modalities data, which may result in the loss of important geometric and semantic information. This letter presents a novel multi-modal, i.e., LiDAR-camera 3D semantic occupancy prediction framework, dubbed Co-Occ, which couples explicit LiDAR-camera feature fusion with implicit volume rendering regularization. The key insight is that volume rendering in the feature space can proficiently bridge the gap between 3D LiDAR sweeps and 2D images while serving as a physical regularization to enhance LiDAR-camera fused volumetric representation. Specifically, we first propose a Geometric- and Semantic-aware Fusion (GSFusion) module to explicitly enhance LiDAR features by incorporating neighboring camera features through a K-nearest neighbors (KNN) search. Then, we employ volume rendering to project the fused feature back to the image planes for reconstructing color and depth maps. These maps are then supervised by input images from the camera and depth estimations derived from LiDAR, respectively. Extensive experiments on the popular nuScenes and SemanticKITTI benchmarks verify the effectiveness of our Co-Occ for 3D semantic occupancy prediction. The project page is available at https://rorisis.github.io/Co-Occ_project-page/.

Co-Occ: Coupling Explicit Feature Fusion with Volume Rendering Regularization for Multi-Modal 3D Semantic Occupancy Prediction

TL;DR

Co-Occ tackles multi-modal 3D semantic occupancy prediction by tightly coupling explicit LiDAR-camera feature fusion with implicit volume rendering regularization. The Geometric- and Semantic-aware Fusion (GSFusion) leverages a KNN-based neighborhood and a learnable gate to fuse camera semantics into sparse LiDAR features within a unified voxel space, producing enhanced fused representations. An auxiliary volume rendering pathway supervises color and depth in the feature space during training, bridging 3D LiDAR sweeps and 2D images and regularizing the fused features without impacting inference. Across nuScenes and SemanticKITTI, Co-Occ achieves state-of-the-art results, validating the effectiveness of combining explicit cross-modal fusion with NeRF-inspired regularization for dense, accurate 3D semantic occupancy predictions.

Abstract

3D semantic occupancy prediction is a pivotal task in the field of autonomous driving. Recent approaches have made great advances in 3D semantic occupancy predictions on a single modality. However, multi-modal semantic occupancy prediction approaches have encountered difficulties in dealing with the modality heterogeneity, modality misalignment, and insufficient modality interactions that arise during the fusion of different modalities data, which may result in the loss of important geometric and semantic information. This letter presents a novel multi-modal, i.e., LiDAR-camera 3D semantic occupancy prediction framework, dubbed Co-Occ, which couples explicit LiDAR-camera feature fusion with implicit volume rendering regularization. The key insight is that volume rendering in the feature space can proficiently bridge the gap between 3D LiDAR sweeps and 2D images while serving as a physical regularization to enhance LiDAR-camera fused volumetric representation. Specifically, we first propose a Geometric- and Semantic-aware Fusion (GSFusion) module to explicitly enhance LiDAR features by incorporating neighboring camera features through a K-nearest neighbors (KNN) search. Then, we employ volume rendering to project the fused feature back to the image planes for reconstructing color and depth maps. These maps are then supervised by input images from the camera and depth estimations derived from LiDAR, respectively. Extensive experiments on the popular nuScenes and SemanticKITTI benchmarks verify the effectiveness of our Co-Occ for 3D semantic occupancy prediction. The project page is available at https://rorisis.github.io/Co-Occ_project-page/.
Paper Structure (17 sections, 12 equations, 10 figures, 7 tables)

This paper contains 17 sections, 12 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: The pipeline of our Co-Occ. Our method utilizes the GSFusion module to acquire explicit fused features that retain both the semantic benefits from the cameras and the geometric benefits from LiDAR. Then, implicit volume rendering-based regularization is applied to bridge the gap between 3D LiDAR and 2D images and enhance fused representation.
  • Figure 2: Our Co-Occ framework. It consists of an explicit GSFusion module and implicit volume rendering regularization. The GSFusion module (Fig. \ref{['fig:gsfusion']}) takes advantage of both the semantic benefits derived from camera features and the geometric benefits obtained from LiDAR. Meanwhile, the implicit volume rendering regularization (Fig. \ref{['fig:imp_render']}) guarantees the fusion of explicit LiDAR-camera features in an accurate and detailed manner, which further enhances the performance of 3D semantic prediction. Notably, implicit volume rendering regularization is only utilized during the training process.
  • Figure 3: The workflow of the GSFusion module begins with searching for $K$ nearest neighbors from camera features to supplement the semantic context of LiDAR features. A KNN gate is then used to obtain weights to boost the LiDAR features. The final step involves concatenating the features.
  • Figure 4: The implicit volume rendering-based regularization involves obtaining frustum features from rays and explicit features. Frustum features are used to create the density grid and color grid, which are then utilized to generate the depth map and color map.
  • Figure 5: The qualitative comparisons results on nuScenes validation set. The leftmost column shows the input surrounding images and LiDAR sweeps, the following three columns visualize the 3D semantic occupancy prediction from OccFormer zhang2023occformer, SurroundOcc wei2023surroundocc (these two predicts results using only cameras), M-CONet wang2023openoccupancy, our Co-Occ, and the annotation from wei2023surroundocc. Better viewed when zoomed in.
  • ...and 5 more figures