Table of Contents
Fetching ...

Radar Spectra-Language Model for Automotive Scene Parsing

Mariia Pushkareva, Yuri Feldman, Csaba Domokos, Kilian Rambach, Dotan Di Castro

TL;DR

This work tackles the interpretability and utility of automotive radar spectra by introducing a radar spectra-language model (RSLM) that aligns radar spectrum embeddings with a frozen vision-language model (VLM). By training a radar encoder to match image embeddings from automotive captions without requiring labeled radar data, the approach enables free-text querying of spectra and semantic retrieval of scene elements. The study demonstrates that RSLM embeddings can boost downstream tasks, improving object detection and free-space segmentation when injected into a baseline detector, and shows improved scene retrieval compared to non-fine-tuned baselines. The results suggest a practical path toward leveraging radar spectra for robust, weather-resilient autonomous driving, with limitations tied to caption quality and the need for more diverse automotive data.

Abstract

Radar sensors are low cost, long-range, and weather-resilient. Therefore, they are widely used for driver assistance functions, and are expected to be crucial for the success of autonomous driving in the future. In many perception tasks only pre-processed radar point clouds are considered. In contrast, radar spectra are a raw form of radar measurements and contain more information than radar point clouds. However, radar spectra are rather difficult to interpret. In this work, we aim to explore the semantic information contained in spectra in the context of automated driving, thereby moving towards better interpretability of radar spectra. To this end, we create a radar spectra-language model, allowing us to query radar spectra measurements for the presence of scene elements using free text. We overcome the scarcity of radar spectra data by matching the embedding space of an existing vision-language model. Finally, we explore the benefit of the learned representation for scene retrieval using radar spectra only, and obtain improvements in free space segmentation and object detection merely by injecting the spectra embedding into a baseline model.

Radar Spectra-Language Model for Automotive Scene Parsing

TL;DR

This work tackles the interpretability and utility of automotive radar spectra by introducing a radar spectra-language model (RSLM) that aligns radar spectrum embeddings with a frozen vision-language model (VLM). By training a radar encoder to match image embeddings from automotive captions without requiring labeled radar data, the approach enables free-text querying of spectra and semantic retrieval of scene elements. The study demonstrates that RSLM embeddings can boost downstream tasks, improving object detection and free-space segmentation when injected into a baseline detector, and shows improved scene retrieval compared to non-fine-tuned baselines. The results suggest a practical path toward leveraging radar spectra for robust, weather-resilient autonomous driving, with limitations tied to caption quality and the need for more diverse automotive data.

Abstract

Radar sensors are low cost, long-range, and weather-resilient. Therefore, they are widely used for driver assistance functions, and are expected to be crucial for the success of autonomous driving in the future. In many perception tasks only pre-processed radar point clouds are considered. In contrast, radar spectra are a raw form of radar measurements and contain more information than radar point clouds. However, radar spectra are rather difficult to interpret. In this work, we aim to explore the semantic information contained in spectra in the context of automated driving, thereby moving towards better interpretability of radar spectra. To this end, we create a radar spectra-language model, allowing us to query radar spectra measurements for the presence of scene elements using free text. We overcome the scarcity of radar spectra data by matching the embedding space of an existing vision-language model. Finally, we explore the benefit of the learned representation for scene retrieval using radar spectra only, and obtain improvements in free space segmentation and object detection merely by injecting the spectra embedding into a baseline model.
Paper Structure (12 sections, 1 equation, 6 figures, 3 tables)

This paper contains 12 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Training of a radar spectra-language model utilizes a frozen vision-language model for supervision. Radar spectra encoder is trained to match image embeddings of the corresponding RGB images. In this way, text embeddings get aligned to radar embeddings as well.
  • Figure 2: Architecture of the radar encoder, with FPN or CNN radar backbone. The MIMO encoder is chosen according to the dataset (CRUW or RADIal).
  • Figure 3: RSLM-Aided detection and segmentation architecture. Input spectra are concurrently fed into the detection backbone and the radar encoder from the pre-trained RSLM.
  • Figure 4: Data retrieval using the trained RSLM. The corresponding images are shown for visualization only, they are not used for data retrieval. The used query appears in the caption of each image.
  • Figure 5: Detection results of FFT-RadNet (green) and our proposed network "FFT-RadNet + RLSM encoder" (blue). The bounding box prediction of FFT-RadNet is displaced w. r. t. the ground truth (red), whereas the predictions of our model align well with the ground truth. Confidence score equals 1.0. Left: Bounding boxes in Cartesian coordinates, radar point clouds displayed for reference only. Note that the models work on spectral data. Thus objects might even be predicted at locations where no radar point clouds are visible. Right: Bounding boxes projected on image.
  • ...and 1 more figures