RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cues for 3D Object Detection
Xiaokai Bai, Chenxu Zhou, Lianqing Zheng, Si-Yuan Cao, Jianan Liu, Xiaohan Zhang, Yiming Li, Zhengzhuang Zhang, Hui-liang Shen
TL;DR
RaGS tackles the challenge of fusing 4D radar and monocular cues for 3D object detection by modeling the scene as a continuous field of 3D Gaussians. Through a cascaded pipeline—Frustum-based Localization Initiation (FLI), Iterative Multimodal Aggregation (IMA), and Multi-level Gaussian Fusion (MGF)—RaGS initializes, refines, and renders Gaussians into hierarchical BEV features, guided by radar velocity to emphasize foreground objects. It explicitly stores physical attributes and learned embeddings for each Gaussian, enabling 3D deformable cross-attention with image semantics and radar geometry, followed by sparse convolution-based fusion and BEV rendering. Experiments on VoD, TJ4DRadSet, and OmniHD-Scenes demonstrate state-of-the-art performance, robustness to perturbations and weather, and favorable efficiency, suggesting the approach offers a scalable and interpretable path toward real-world multi-modal 3D perception.
Abstract
4D millimeter-wave radar is a promising sensing modality for autonomous driving, yet effective 3D object detection from 4D radar and monocular images remains challenging. Existing fusion approaches either rely on instance proposals lacking global context or dense BEV grids constrained by rigid structures, lacking a flexible and adaptive representation for diverse scenes. To address this, we propose RaGS, the first framework that leverages 3D Gaussian Splatting (GS) to fuse 4D radar and monocular cues for 3D object detection. 3D GS models the scene as a continuous field of Gaussians, enabling dynamic resource allocation to foreground objects while maintaining flexibility and efficiency. Moreover, the velocity dimension of 4D radar provides motion cues that help anchor and refine the spatial distribution of Gaussians. Specifically, RaGS adopts a cascaded pipeline to construct and progressively refine the Gaussian field. It begins with Frustum-based Localization Initiation (FLI), which unprojects foreground pixels to initialize coarse Gaussian centers. Then, Iterative Multimodal Aggregation (IMA) explicitly exploits image semantics and implicitly integrates 4D radar velocity geometry to refine the Gaussians within regions of interest. Finally, Multi-level Gaussian Fusion (MGF) renders the Gaussian field into hierarchical BEV features for 3D object detection. By dynamically focusing on sparse and informative regions, RaGS achieves object-centric precision and comprehensive scene perception. Extensive experiments on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes demonstrate its robustness and SOTA performance. Code will be released.
