Table of Contents
Fetching ...

RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cues for 3D Object Detection

Xiaokai Bai, Chenxu Zhou, Lianqing Zheng, Si-Yuan Cao, Jianan Liu, Xiaohan Zhang, Yiming Li, Zhengzhuang Zhang, Hui-liang Shen

TL;DR

RaGS tackles the challenge of fusing 4D radar and monocular cues for 3D object detection by modeling the scene as a continuous field of 3D Gaussians. Through a cascaded pipeline—Frustum-based Localization Initiation (FLI), Iterative Multimodal Aggregation (IMA), and Multi-level Gaussian Fusion (MGF)—RaGS initializes, refines, and renders Gaussians into hierarchical BEV features, guided by radar velocity to emphasize foreground objects. It explicitly stores physical attributes and learned embeddings for each Gaussian, enabling 3D deformable cross-attention with image semantics and radar geometry, followed by sparse convolution-based fusion and BEV rendering. Experiments on VoD, TJ4DRadSet, and OmniHD-Scenes demonstrate state-of-the-art performance, robustness to perturbations and weather, and favorable efficiency, suggesting the approach offers a scalable and interpretable path toward real-world multi-modal 3D perception.

Abstract

4D millimeter-wave radar is a promising sensing modality for autonomous driving, yet effective 3D object detection from 4D radar and monocular images remains challenging. Existing fusion approaches either rely on instance proposals lacking global context or dense BEV grids constrained by rigid structures, lacking a flexible and adaptive representation for diverse scenes. To address this, we propose RaGS, the first framework that leverages 3D Gaussian Splatting (GS) to fuse 4D radar and monocular cues for 3D object detection. 3D GS models the scene as a continuous field of Gaussians, enabling dynamic resource allocation to foreground objects while maintaining flexibility and efficiency. Moreover, the velocity dimension of 4D radar provides motion cues that help anchor and refine the spatial distribution of Gaussians. Specifically, RaGS adopts a cascaded pipeline to construct and progressively refine the Gaussian field. It begins with Frustum-based Localization Initiation (FLI), which unprojects foreground pixels to initialize coarse Gaussian centers. Then, Iterative Multimodal Aggregation (IMA) explicitly exploits image semantics and implicitly integrates 4D radar velocity geometry to refine the Gaussians within regions of interest. Finally, Multi-level Gaussian Fusion (MGF) renders the Gaussian field into hierarchical BEV features for 3D object detection. By dynamically focusing on sparse and informative regions, RaGS achieves object-centric precision and comprehensive scene perception. Extensive experiments on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes demonstrate its robustness and SOTA performance. Code will be released.

RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cues for 3D Object Detection

TL;DR

RaGS tackles the challenge of fusing 4D radar and monocular cues for 3D object detection by modeling the scene as a continuous field of 3D Gaussians. Through a cascaded pipeline—Frustum-based Localization Initiation (FLI), Iterative Multimodal Aggregation (IMA), and Multi-level Gaussian Fusion (MGF)—RaGS initializes, refines, and renders Gaussians into hierarchical BEV features, guided by radar velocity to emphasize foreground objects. It explicitly stores physical attributes and learned embeddings for each Gaussian, enabling 3D deformable cross-attention with image semantics and radar geometry, followed by sparse convolution-based fusion and BEV rendering. Experiments on VoD, TJ4DRadSet, and OmniHD-Scenes demonstrate state-of-the-art performance, robustness to perturbations and weather, and favorable efficiency, suggesting the approach offers a scalable and interpretable path toward real-world multi-modal 3D perception.

Abstract

4D millimeter-wave radar is a promising sensing modality for autonomous driving, yet effective 3D object detection from 4D radar and monocular images remains challenging. Existing fusion approaches either rely on instance proposals lacking global context or dense BEV grids constrained by rigid structures, lacking a flexible and adaptive representation for diverse scenes. To address this, we propose RaGS, the first framework that leverages 3D Gaussian Splatting (GS) to fuse 4D radar and monocular cues for 3D object detection. 3D GS models the scene as a continuous field of Gaussians, enabling dynamic resource allocation to foreground objects while maintaining flexibility and efficiency. Moreover, the velocity dimension of 4D radar provides motion cues that help anchor and refine the spatial distribution of Gaussians. Specifically, RaGS adopts a cascaded pipeline to construct and progressively refine the Gaussian field. It begins with Frustum-based Localization Initiation (FLI), which unprojects foreground pixels to initialize coarse Gaussian centers. Then, Iterative Multimodal Aggregation (IMA) explicitly exploits image semantics and implicitly integrates 4D radar velocity geometry to refine the Gaussians within regions of interest. Finally, Multi-level Gaussian Fusion (MGF) renders the Gaussian field into hierarchical BEV features for 3D object detection. By dynamically focusing on sparse and informative regions, RaGS achieves object-centric precision and comprehensive scene perception. Extensive experiments on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes demonstrate its robustness and SOTA performance. Code will be released.

Paper Structure

This paper contains 15 sections, 13 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: 4D radar and camera fusion pipelines. (a) Instance-based fusion relies on 2D detection, limiting scene understanding. (b) BEV-based fusion uses predefined grids, causing inefficiencies in background modeling and fixed anchor sampling. (c) Our Gaussian-based fusion offers adaptive sparse objects attention while preserving scene perception.
  • Figure 2: Pipeline of RaGS. RaGS consists of a Feature Extractor & Head, Frustum-based Localization Initiation (FLI), Iterative Multimodal Aggregation (IMA), and Multi-level Gaussian Fusion (MGF). The positions of the Gaussians are initialized using the FLI module, along with learnable attributes such as rotation, scale, opacity, and implicit feature embeddings. These Gaussians are then passed into the IMA module, where they are projected onto the image plane to gather semantic information. Next, they are processed as voxels using sparse convolution with height-extended radar geometry, which implicitly utilizes radar velocity to guide residuals movement. Residuals relative to regions of interest are computed iteratively, updating the positions towards sparse objects. Finally, the multi-level Gaussians are rendered into Bird’s Eye View (BEV) features and fused through MGF, followed by cross-modal fusion for 3D object detection.
  • Figure 3: Procedure of Iterative Multimodal Aggregation (IMA). IMA involves the iterative aggregation of multi-modal features, followed by the updating of Gaussian locations within the frustum.
  • Figure 4: Dynamic Object Attention of RaGS. We visualize activated Gaussians (approximately 30% of total) in the scene. RaGS focuses on sparse foreground objects while maintaining scene understanding.
  • Figure 5: Visualization results on the VoD validation set (first row) and TJ4DRadSet test set (second row) . Each figure corresponds to a frame. Orange and yellow boxes represent ground-truths in the perspective and bird’seye views, respectively. Green and blue boxes indicate predicted results. Zoom in for better view.
  • ...and 1 more figures