Table of Contents
Fetching ...

4DRC-OCC: Robust Semantic Occupancy Prediction Through Fusion of 4D Radar and Camera

David Ninfa, Andras Palffy, Holger Caesar

TL;DR

This work presents the first study combining 4D radar and camera data for 3D semantic occupancy prediction, and shows that integrating depth cues from camera pixels enables lifting 2D images to 3D, improving scene reconstruction accuracy.

Abstract

Autonomous driving requires robust perception across diverse environmental conditions, yet 3D semantic occupancy prediction remains challenging under adverse weather and lighting. In this work, we present the first study combining 4D radar and camera data for 3D semantic occupancy prediction. Our fusion leverages the complementary strengths of both modalities: 4D radar provides reliable range, velocity, and angle measurements in challenging conditions, while cameras contribute rich semantic and texture information. We further show that integrating depth cues from camera pixels enables lifting 2D images to 3D, improving scene reconstruction accuracy. Additionally, we introduce a fully automatically labeled dataset for training semantic occupancy models, substantially reducing reliance on costly manual annotation. Experiments demonstrate the robustness of 4D radar across diverse scenarios, highlighting its potential to advance autonomous vehicle perception.

4DRC-OCC: Robust Semantic Occupancy Prediction Through Fusion of 4D Radar and Camera

TL;DR

This work presents the first study combining 4D radar and camera data for 3D semantic occupancy prediction, and shows that integrating depth cues from camera pixels enables lifting 2D images to 3D, improving scene reconstruction accuracy.

Abstract

Autonomous driving requires robust perception across diverse environmental conditions, yet 3D semantic occupancy prediction remains challenging under adverse weather and lighting. In this work, we present the first study combining 4D radar and camera data for 3D semantic occupancy prediction. Our fusion leverages the complementary strengths of both modalities: 4D radar provides reliable range, velocity, and angle measurements in challenging conditions, while cameras contribute rich semantic and texture information. We further show that integrating depth cues from camera pixels enables lifting 2D images to 3D, improving scene reconstruction accuracy. Additionally, we introduce a fully automatically labeled dataset for training semantic occupancy models, substantially reducing reliance on costly manual annotation. Experiments demonstrate the robustness of 4D radar across diverse scenarios, highlighting its potential to advance autonomous vehicle perception.
Paper Structure (20 sections, 1 equation, 5 figures, 2 tables)

This paper contains 20 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Adverse lighting conditions significantly impact the camera-only baseline; however, our network still detects the cyclist through radar-camera fusion.
  • Figure 2: The architecture of 4DRC-OCC: The main network, Version A, processes radar and camera separately in a multi-scale manner, reaching a higher level of abstraction before merging them in voxel space to produce the final prediction. Versions B and Version C build upon version A by incorporating additional radar depth cues at different stages, enhancing the camera image and further assisting the lifting mechanism. This allows for a more refined fusion of radar and camera information, improving performance.
  • Figure 3: The process of auto-labeling dense occupancy pseudo-ground truth involves capturing dense lidar data (1), performing semantic segmentation (2), and separately extracting and accumulating dynamic (3) and static objects (4). Both point clouds are transformed into the world coordinate system (5) before being voxelized and refined to create the final occupancy labels (6).
  • Figure 4: 4DRC-OCC improves prediction accuracy in poor lighting conditions by integrating data from 4D radar and cameras.
  • Figure 5: Depth association strategies enhance predictions by improving spatial accuracy and semantic understanding in complex environments.