Table of Contents
Fetching ...

RadarGen: Automotive Radar Point Cloud Generation from Cameras

Tomer Borreda, Fangqiang Ding, Sanja Fidler, Shengyu Huang, Or Litany

TL;DR

RadarGen tackles the lack of scalable, realistic radar data in multimodal autonomous driving simulators by introducing a diffusion-based framework that generates radar point clouds from multi-view camera images. It represents radar as BEV maps encoding density, RCS, and Doppler, and conditions generation on BEV priors derived from foundation models for depth, semantics, and motion, followed by deconvolution to recover sparse 3D points. The approach yields higher geometric and attribute fidelity than a strong baseline and enables scene editing via image manipulation, with demonstrated compatibility for downstream perception models. This work advances multimodal generative simulation by bridging vision and radar sensing in a scalable, controllable manner.

Abstract

We present RadarGen, a diffusion model for synthesizing realistic automotive radar point clouds from multi-view camera imagery. RadarGen adapts efficient image-latent diffusion to the radar domain by representing radar measurements in bird's-eye-view form that encodes spatial structure together with radar cross section (RCS) and Doppler attributes. A lightweight recovery step reconstructs point clouds from the generated maps. To better align generation with the visual scene, RadarGen incorporates BEV-aligned depth, semantic, and motion cues extracted from pretrained foundation models, which guide the stochastic generation process toward physically plausible radar patterns. Conditioning on images makes the approach broadly compatible, in principle, with existing visual datasets and simulation frameworks, offering a scalable direction for multimodal generative simulation. Evaluations on large-scale driving data show that RadarGen captures characteristic radar measurement distributions and reduces the gap to perception models trained on real data, marking a step toward unified generative simulation across sensing modalities.

RadarGen: Automotive Radar Point Cloud Generation from Cameras

TL;DR

RadarGen tackles the lack of scalable, realistic radar data in multimodal autonomous driving simulators by introducing a diffusion-based framework that generates radar point clouds from multi-view camera images. It represents radar as BEV maps encoding density, RCS, and Doppler, and conditions generation on BEV priors derived from foundation models for depth, semantics, and motion, followed by deconvolution to recover sparse 3D points. The approach yields higher geometric and attribute fidelity than a strong baseline and enables scene editing via image manipulation, with demonstrated compatibility for downstream perception models. This work advances multimodal generative simulation by bridging vision and radar sensing in a scalable, controllable manner.

Abstract

We present RadarGen, a diffusion model for synthesizing realistic automotive radar point clouds from multi-view camera imagery. RadarGen adapts efficient image-latent diffusion to the radar domain by representing radar measurements in bird's-eye-view form that encodes spatial structure together with radar cross section (RCS) and Doppler attributes. A lightweight recovery step reconstructs point clouds from the generated maps. To better align generation with the visual scene, RadarGen incorporates BEV-aligned depth, semantic, and motion cues extracted from pretrained foundation models, which guide the stochastic generation process toward physically plausible radar patterns. Conditioning on images makes the approach broadly compatible, in principle, with existing visual datasets and simulation frameworks, offering a scalable direction for multimodal generative simulation. Evaluations on large-scale driving data show that RadarGen captures characteristic radar measurement distributions and reduces the gap to perception models trained on real data, marking a step toward unified generative simulation across sensing modalities.

Paper Structure

This paper contains 37 sections, 11 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Controllable radar synthesis from vision. (Top) Given multi-view camera images, RadarGen generates realistic radar point clouds that align with real-world radar statistics and can be consumed by downstream perception models. (Bottom) The generation is semantically consistent: modifying the input scene with an off-the-shelf image editing tool (e.g., replacing a distant car with a closer truck) updates the radar response, removing returns from newly occluded regions and reflecting the new object geometry.
  • Figure 2: Overview of RadarGen. (Left) Multi-view posed images at time $t$ and $t+\Delta t$ are fed through foundation model of metric depth estimation piccinelli2025unidepthv2, semantic segmentation cheng2021mask2former, and optical flow zhang2025ufm, enabling projection of the scene to BEV, encoding different information through color. (Middle) Encoded BEV representation is concatenated with a modality indicator specifying which map type to generate. During inference, the map is initialized as noise; during training, noise is added to the GT maps, and the Latent Encoder/Decoder are frozen while SANA's DiT xie2024sanapeebles2023scalable is fine-tuned. (Right) During inference, the generated Point Density Map is deconvolved using an IRL1 Solver. The resulting sparse map is used to retrieve the RCS and Doppler values at corresponding locations to yield the final generated radar point cloud. Point color represents Doppler and point size represents RCS.
  • Figure 3: Overview of representing radar as images (\ref{['sec:method-radar-as-images']}). Constructing radar maps from a radar point cloud requires first rasterizing each point to BEV. The point locations are then convolved with a Gaussian kernel $K_\sigma$ to produce the Point Density Map $M_p$. A Voronoi tessellation is also constructed, where each cell inherits the RCS and Doppler attributes from its corresponding point, producing the maps $M_r$ and $M_d$ respectively. Point color represents Doppler and point size represents RCS.
  • Figure 4: Qualitative results. Our model generates point clouds with higher geometric and attribute fidelity to the ground truth compared to the baseline. RadarGen uses inputs $t$ and $t+\Delta t$, while the baseline uses only $t$. Ground truth bounding boxes are highlighted in color.
  • Figure 5: Scene editing. Modifying the input images using an off-the-shelf image editing tool updates the radar response, demonstrating object removal (left) and insertion (right).
  • ...and 7 more figures