Table of Contents
Fetching ...

LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction using LiDAR and Camera

Yukai Ma, Jianbiao Mei, Xuemeng Yang, Licheng Wen, Weihua Xu, Jiangning Zhang, Botian Shi, Yong Liu, Xingxing Zuo

TL;DR

The paper tackles semantic scene completion (SSC) under adverse weather by leveraging radar, proposing LiCROcc—a BEV-based fusion framework with a radar-based student and a LiDAR-camera teacher. It introduces fusion-based cross-modal knowledge distillation comprising Cross-Model Residual Distillation, BEV Relation Distillation, and Predictive Distribution Distillation to transfer rich semantic and geometric cues from LiDAR-camera fusion to radar. The approach yields substantial gains on nuScenes-Occupancy, with radar-only and radar-camera variants approaching LiDAR-camera performance and demonstrating robustness to weather and lighting. This work advances practical, weather-resilient SSC for autonomous driving and establishes radar-centered benchmarks and KD strategies for multi-modal 3D perception.

Abstract

Semantic Scene Completion (SSC) is pivotal in autonomous driving perception, frequently confronted with the complexities of weather and illumination changes. The long-term strategy involves fusing multi-modal information to bolster the system's robustness. Radar, increasingly utilized for 3D target detection, is gradually replacing LiDAR in autonomous driving applications, offering a robust sensing alternative. In this paper, we focus on the potential of 3D radar in semantic scene completion, pioneering cross-modal refinement techniques for improved robustness against weather and illumination changes, and enhancing SSC performance.Regarding model architecture, we propose a three-stage tight fusion approach on BEV to realize a fusion framework for point clouds and images. Based on this foundation, we designed three cross-modal distillation modules-CMRD, BRD, and PDD. Our approach enhances the performance in both radar-only (R-LiCROcc) and radar-camera (RC-LiCROcc) settings by distilling to them the rich semantic and structural information of the fused features of LiDAR and camera. Finally, our LC-Fusion (teacher model), R-LiCROcc and RC-LiCROcc achieve the best performance on the nuScenes-Occupancy dataset, with mIOU exceeding the baseline by 22.9%, 44.1%, and 15.5%, respectively. The project page is available at https://hr-zju.github.io/LiCROcc/.

LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction using LiDAR and Camera

TL;DR

The paper tackles semantic scene completion (SSC) under adverse weather by leveraging radar, proposing LiCROcc—a BEV-based fusion framework with a radar-based student and a LiDAR-camera teacher. It introduces fusion-based cross-modal knowledge distillation comprising Cross-Model Residual Distillation, BEV Relation Distillation, and Predictive Distribution Distillation to transfer rich semantic and geometric cues from LiDAR-camera fusion to radar. The approach yields substantial gains on nuScenes-Occupancy, with radar-only and radar-camera variants approaching LiDAR-camera performance and demonstrating robustness to weather and lighting. This work advances practical, weather-resilient SSC for autonomous driving and establishes radar-centered benchmarks and KD strategies for multi-modal 3D perception.

Abstract

Semantic Scene Completion (SSC) is pivotal in autonomous driving perception, frequently confronted with the complexities of weather and illumination changes. The long-term strategy involves fusing multi-modal information to bolster the system's robustness. Radar, increasingly utilized for 3D target detection, is gradually replacing LiDAR in autonomous driving applications, offering a robust sensing alternative. In this paper, we focus on the potential of 3D radar in semantic scene completion, pioneering cross-modal refinement techniques for improved robustness against weather and illumination changes, and enhancing SSC performance.Regarding model architecture, we propose a three-stage tight fusion approach on BEV to realize a fusion framework for point clouds and images. Based on this foundation, we designed three cross-modal distillation modules-CMRD, BRD, and PDD. Our approach enhances the performance in both radar-only (R-LiCROcc) and radar-camera (RC-LiCROcc) settings by distilling to them the rich semantic and structural information of the fused features of LiDAR and camera. Finally, our LC-Fusion (teacher model), R-LiCROcc and RC-LiCROcc achieve the best performance on the nuScenes-Occupancy dataset, with mIOU exceeding the baseline by 22.9%, 44.1%, and 15.5%, respectively. The project page is available at https://hr-zju.github.io/LiCROcc/.
Paper Structure (23 sections, 7 equations, 4 figures, 5 tables)

This paper contains 23 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Our motivation for proposing LiCROcc. We explore radar performance on SSC tasks with the expectation of developing a network with balanced performance and robustness. We further improve radar sensors' semantic occupancy prediction and distance measurement capabilities by cross-modal distillation while maintaining their inherent night vision capabilities and weather robustness. In the figure on the right, the blue line represents the capability of the teacher model, the dashed line represents the capability of the pre-distillation student in our two settings, and the solid line in the same color represents the equilibrium performance of the model after KD. The graphs on the bottom demonstrate the enhancement of our method after gradually adding modules and the comparison with other methods on nuScenes-Occupancy wang2023openoccupancy (validation set), including Co-Occ pan2024co, RC-CoNet wang2023openoccupancy, and PointOcc zuo2023pointocc.
  • Figure 2: Overall framework of our LiCROcc. We designed a base framework for point cloud and image fusion, where the models serve as the teacher (LC-Fusion Network) and student (RC-Fusion Network), respectively. We unified both fusion processes under BEV space to reduce computational cost. Additionally, we designed three novel distillation losses (CMRD, BRD and PDD) to achieve effective cross-modal KD. Inference is performed using only the RC fusion Network, where camera input is optional. For the detailed structure of CMRD and BRD, please refer to Fig. \ref{['fig:KD']}.
  • Figure 3: Detailed illustration of the use of CMRD and BRD on BEV features. The feature with the blue background is a BEV feature duplicated from the teacher model. The details of the loss computation are illustrated in the dashed box. The $\oplus$ in the figure denotes the summation of features.
  • Figure 4: Visual comparison of our LiCROcc with baseline methods on OppenOccupancuy benchmark. Our methods stand out by offering a more comprehensive representation of the scene and more accurate segmentation boundaries compared to CoOcc and PointOcc. Surprisingly, the R-LiCROcc can segment feasible regions and obstacles even with only two thousand radar points as input.