Table of Contents
Fetching ...

BEVCar: Camera-Radar Fusion for BEV Map and Object Segmentation

Jonas Schramm, Niclas Vödisch, Kürsat Petek, B Ravi Kiran, Senthil Yogamani, Wolfram Burgard, Abhinav Valada

TL;DR

BEVCar addresses BEV map and object segmentation by fusing surround-view cameras with automotive radar. It introduces a learned radar point encoding and a radar-guided, attention-based image lifting pipeline, followed by a cross-attention fusion and a multitask BEV segmentation head. On nuScenes, BEVCar outperforms camera-only and prior camera-radar methods, with substantial gains in vehicle IoU and map IoU and improved robustness under rain and night conditions. The work demonstrates the practical value of radar for robust BEV perception and provides public weather splits and code to accelerate future research.

Abstract

Semantic scene segmentation from a bird's-eye-view (BEV) perspective plays a crucial role in facilitating planning and decision-making for mobile robots. Although recent vision-only methods have demonstrated notable advancements in performance, they often struggle under adverse illumination conditions such as rain or nighttime. While active sensors offer a solution to this challenge, the prohibitively high cost of LiDARs remains a limiting factor. Fusing camera data with automotive radars poses a more inexpensive alternative but has received less attention in prior research. In this work, we aim to advance this promising avenue by introducing BEVCar, a novel approach for joint BEV object and map segmentation. The core novelty of our approach lies in first learning a point-based encoding of raw radar data, which is then leveraged to efficiently initialize the lifting of image features into the BEV space. We perform extensive experiments on the nuScenes dataset and demonstrate that BEVCar outperforms the current state of the art. Moreover, we show that incorporating radar information significantly enhances robustness in challenging environmental conditions and improves segmentation performance for distant objects. To foster future research, we provide the weather split of the nuScenes dataset used in our experiments, along with our code and trained models at http://bevcar.cs.uni-freiburg.de.

BEVCar: Camera-Radar Fusion for BEV Map and Object Segmentation

TL;DR

BEVCar addresses BEV map and object segmentation by fusing surround-view cameras with automotive radar. It introduces a learned radar point encoding and a radar-guided, attention-based image lifting pipeline, followed by a cross-attention fusion and a multitask BEV segmentation head. On nuScenes, BEVCar outperforms camera-only and prior camera-radar methods, with substantial gains in vehicle IoU and map IoU and improved robustness under rain and night conditions. The work demonstrates the practical value of radar for robust BEV perception and provides public weather splits and code to accelerate future research.

Abstract

Semantic scene segmentation from a bird's-eye-view (BEV) perspective plays a crucial role in facilitating planning and decision-making for mobile robots. Although recent vision-only methods have demonstrated notable advancements in performance, they often struggle under adverse illumination conditions such as rain or nighttime. While active sensors offer a solution to this challenge, the prohibitively high cost of LiDARs remains a limiting factor. Fusing camera data with automotive radars poses a more inexpensive alternative but has received less attention in prior research. In this work, we aim to advance this promising avenue by introducing BEVCar, a novel approach for joint BEV object and map segmentation. The core novelty of our approach lies in first learning a point-based encoding of raw radar data, which is then leveraged to efficiently initialize the lifting of image features into the BEV space. We perform extensive experiments on the nuScenes dataset and demonstrate that BEVCar outperforms the current state of the art. Moreover, we show that incorporating radar information significantly enhances robustness in challenging environmental conditions and improves segmentation performance for distant objects. To foster future research, we provide the weather split of the nuScenes dataset used in our experiments, along with our code and trained models at http://bevcar.cs.uni-freiburg.de.
Paper Structure (12 sections, 6 equations, 5 figures, 5 tables)

This paper contains 12 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: We propose a novel method for BEV Camera-radar fusion (BEVCar) for map and object segmentation. We demonstrate that BEVCar yields more accurate predictions under adverse weather conditions than camera-only baselines while outperforming prior camera-radar works harley2023simplebev.
  • Figure 2: Overview of our proposed BEVCar approach for camera-radar fusion for BEV map and object segmentation. We utilize a frozen DINOv2 oquab2023dinov2 with a learnable adapter to encode the surround-view images. Inspired by LiDAR-based perception zhou2018voxelnet, we employ a learnable radar encoding instead of processing the raw metadata. We then lift the image features to the BEV space via deformable attention including the novel radar-driven query initialization scheme. Finally, we fuse the lifted image representation with the learned radar features in an attention-based manner and perform multi-class BEV segmentation for both vehicles and the map categories.
  • Figure 3: Inspired by LiDAR processing, we encode the radar data with fully connected layers (FCN) in a point-wise manner and combine point features within a voxel with max pooling. Subsequently, we employ a CNN-based height compression to obtain the overall radar features in the BEV space.
  • Figure 4: Our data-driven query initialization scheme leverages 3D radar information to guide lifting the 2D image features to the BEV space. While the image BEV features are only obtained from uniform assignment along camera rays, the final $Q_\mathit{img}^L$ considers depth from radar via deformable attention.
  • Figure 5: Qualitative results of our proposed BEVCar, the camera-only baseline, and Simple-BEV++ (ViT-B/14), for which we also show the improvement/error map. Pixels misclassified by Simple-BEV++ and correctly predicted by BEVCar are shown in green, pixels misclassified by BEVCar and correctly predicted by Simple-BEV++ in blue, and pixels misclassified by both models in red.