Scaling Foundation Models for Radar Scene Understanding
Pushkal Mishra, Kshitiz Bansal, Dinesh Bharadia
TL;DR
RadarFM addresses the fragmentation of radar perception by learning unified, spatially grounded scene representations through structured natural language supervision. It combines a CLIP-based radar encoder with a hash-aware contrastive objective and a generative captioning pathway to model continuous spatial similarity in radar scenes. The authors introduce a large-scale CARLA-based radar dataset with spatially grounded JSON annotations, plus localization-aware metrics that evaluate spatial reasoning beyond traditional detection metrics. Experimental results show attention focusing on vehicle-rich regions, superior far-field spatial reasoning with distance-stratified models, and robust caption generation, suggesting practical benefits for end-to-end radar-driven perception and sim-to-real transfer. Overall, RadarFM advances radar-centric foundation models by enabling fine-grained, semantically meaningful scene understanding in adverse conditions and across diverse driving scenarios.
Abstract
Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.
