Table of Contents
Fetching ...

Scaling Foundation Models for Radar Scene Understanding

Pushkal Mishra, Kshitiz Bansal, Dinesh Bharadia

TL;DR

RadarFM addresses the fragmentation of radar perception by learning unified, spatially grounded scene representations through structured natural language supervision. It combines a CLIP-based radar encoder with a hash-aware contrastive objective and a generative captioning pathway to model continuous spatial similarity in radar scenes. The authors introduce a large-scale CARLA-based radar dataset with spatially grounded JSON annotations, plus localization-aware metrics that evaluate spatial reasoning beyond traditional detection metrics. Experimental results show attention focusing on vehicle-rich regions, superior far-field spatial reasoning with distance-stratified models, and robust caption generation, suggesting practical benefits for end-to-end radar-driven perception and sim-to-real transfer. Overall, RadarFM advances radar-centric foundation models by enabling fine-grained, semantically meaningful scene understanding in adverse conditions and across diverse driving scenarios.

Abstract

Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.

Scaling Foundation Models for Radar Scene Understanding

TL;DR

RadarFM addresses the fragmentation of radar perception by learning unified, spatially grounded scene representations through structured natural language supervision. It combines a CLIP-based radar encoder with a hash-aware contrastive objective and a generative captioning pathway to model continuous spatial similarity in radar scenes. The authors introduce a large-scale CARLA-based radar dataset with spatially grounded JSON annotations, plus localization-aware metrics that evaluate spatial reasoning beyond traditional detection metrics. Experimental results show attention focusing on vehicle-rich regions, superior far-field spatial reasoning with distance-stratified models, and robust caption generation, suggesting practical benefits for end-to-end radar-driven perception and sim-to-real transfer. Overall, RadarFM advances radar-centric foundation models by enabling fine-grained, semantically meaningful scene understanding in adverse conditions and across diverse driving scenarios.

Abstract

Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.

Paper Structure

This paper contains 30 sections, 12 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Lane distribution visualization showing the twelve lane-relative angular sectors used for spatial encoding of vehicles relative to the ego vehicle.
  • Figure 2: Dataset collection overview. (a) Different sensor viewpoints of a traffic example. (b) Extracted data of vehicles in the scene, with their lane classification. The dashed circles represent distance radii of 10m, 20m, 30m, and 40m. (c) Lane-wise distribution of collected data across the entire dataset represented in radial format.
  • Figure 3: The Pre-training phase. Radar range-angle heatmaps are encoded via ViT-B/16 backbone, while captions are processed through a GPT-2-like transformer encoder. The radar and text embeddings are projected into a shared 512-dimensional space where hash-aware contrastive learning aligns semantically similar scenes based on spatial configuration overlap.
  • Figure 4: Fine-tuning phase with generative captioning. The pre-trained text encoder is frozen, and a lightweight transformer-based mapping network projects radar embeddings into GPT-2's input space. This enables post-training autoregressive generation with radar embeddings as input to the projection network.
  • Figure 5: Attention rollout visualization from pre-trained radar encoder. Top: Camera reference image. Middle: Ground truth (left) and predicted (right) vehicle distributions in radial format across distance bins and angular sectors. Bottom: Radar range-angle heatmap (left) and attention rollout overlay (right) showing cumulative attention weights from the final three transformer layers. The attention weights concentrate precisely on regions containing vehicles, validating the effectiveness of hash-aware contrastive learning.
  • ...and 1 more figures