Unleashing HyDRa: Hybrid Fusion, Depth Consistency and Radar for Unified 3D Perception

Philipp Wolters; Johannes Gilg; Torben Teepe; Fabian Herzog; Anouar Laouichi; Martin Hofmann; Gerhard Rigoll

Unleashing HyDRa: Hybrid Fusion, Depth Consistency and Radar for Unified 3D Perception

Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, Anouar Laouichi, Martin Hofmann, Gerhard Rigoll

TL;DR

HyDRa addresses the need for robust, low-cost 3D perception by fusing camera and radar data in two spaces: perspective view and BEV. It introduces the Height Association Transformer (HAT) to propagate radar cues into dense depth in the image frustum, and Radar-weighted Depth Consistency (RDC) to refine BEV features via radar-guided back-projection and depth alignment, formalized with depth-consistency weights such as $w_c = d_P \cdot d_Q$. The approach yields state-of-the-art camera-radar fusion on nuScenes (e.g., 64.2 NDS on test) and improves semantic occupancy predictions on Occ3D by about 3.7 mIoU over camera-only methods, with strong ablations supporting the value of early, hybrid fusion. These results demonstrate that synergistic camera-radar fusion can deliver robust depth and velocity estimation, improving detection, tracking, and scene understanding in challenging lighting and weather conditions, and offering practical impact for safer autonomous driving.

Abstract

Low-cost, vision-centric 3D perception systems for autonomous driving have made significant progress in recent years, narrowing the gap to expensive LiDAR-based methods. The primary challenge in becoming a fully reliable alternative lies in robust depth prediction capabilities, as camera-based systems struggle with long detection ranges and adverse lighting and weather conditions. In this work, we introduce HyDRa, a novel camera-radar fusion architecture for diverse 3D perception tasks. Building upon the principles of dense BEV (Bird's Eye View)-based architectures, HyDRa introduces a hybrid fusion approach to combine the strengths of complementary camera and radar features in two distinct representation spaces. Our Height Association Transformer module leverages radar features already in the perspective view to produce more robust and accurate depth predictions. In the BEV, we refine the initial sparse representation by a Radar-weighted Depth Consistency. HyDRa achieves a new state-of-the-art for camera-radar fusion of 64.2 NDS (+1.8) and 58.4 AMOTA (+1.5) on the public nuScenes dataset. Moreover, our new semantically rich and spatially accurate BEV features can be directly converted into a powerful occupancy representation, beating all previous camera-based methods on the Occ3D benchmark by an impressive 3.7 mIoU. Code and models are available at https://github.com/phi-wol/hydra.

Unleashing HyDRa: Hybrid Fusion, Depth Consistency and Radar for Unified 3D Perception

TL;DR

. The approach yields state-of-the-art camera-radar fusion on nuScenes (e.g., 64.2 NDS on test) and improves semantic occupancy predictions on Occ3D by about 3.7 mIoU over camera-only methods, with strong ablations supporting the value of early, hybrid fusion. These results demonstrate that synergistic camera-radar fusion can deliver robust depth and velocity estimation, improving detection, tracking, and scene understanding in challenging lighting and weather conditions, and offering practical impact for safer autonomous driving.

Abstract

Paper Structure (18 sections, 3 equations, 7 figures, 9 tables)

This paper contains 18 sections, 3 equations, 7 figures, 9 tables.

Introduction
Related Work
Camera-based Architectures
Multi-modal Architectures
HyDRa Architecture
Vision-Centric Foundation
Height Association Transformer
Radar-Weighted Depth Consistency
Down-stream Tasks
Experiments
Dataset and Metrics
Implementation Details
Main Results
Ablation Studies
Conclusion
...and 3 more sections

Figures (7)

Figure 1: Bridging the view disparity in BEVFusion liang2022bevfusion, CRN kim2023crn, and our HyDRa. We leverage multi-modal feature fusion alraedy for depth splatting.
Figure 2: Architecture of HyDRa: The modality-specific features are fused in two representation spaces: Perspective View and BEV-Space. 1. The radar features are associated with the image features by the Height Association Transformer. With the resulting radar-informed dense depth, the forward projection module generates a sparse BEV representation. 2. The splatted semantic BEV features and radar-BEV features are concatenated and fused. 3. A depth-aware backward projection refines this representation, guided by radar attention weights before being distributed to task-specific heads.
Figure 3: Overview of the Height Association Transformer. The radar fusion module exertes a pushing effect into the BEV.
Figure 4: Details of the Radar-weighted Backward Projection. The radar features pull RGB information for refinement.
Figure 5: Qualitative comparison of the semantic occupancy prediction in a challenging night scenario. The top row shows the front-view input cameras. We compare FB-OCC li2023fbocc with our proposed HyDRa. While the baseline struggles to distinguish different objects at distance, HyDRa showcases spatial consistency and robustness of the detected cars (orange), truck (red) and pedestrian (blue).
...and 2 more figures

Unleashing HyDRa: Hybrid Fusion, Depth Consistency and Radar for Unified 3D Perception

TL;DR

Abstract

Unleashing HyDRa: Hybrid Fusion, Depth Consistency and Radar for Unified 3D Perception

Authors

TL;DR

Abstract

Table of Contents

Figures (7)