Table of Contents
Fetching ...

Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving

Shaoyuan Xie, Lingdong Kong, Wenwei Zhang, Jiawei Ren, Liang Pan, Kai Chen, Ziwei Liu

TL;DR

The paper introduces RoboBEV, a large-scale robustness benchmark for BEV perception in autonomous driving, designed to test resilience under eight natural corruptions across three severity levels and complete sensor failures. It evaluates 33 BEV models across detection, map segmentation, depth estimation, and occupancy prediction, revealing a strong link between in-distribution performance and robustness while noting that higher clean scores do not guarantee better robustness. Key findings show that depth-free BEV transformations, model pre-training, and rich temporal information substantially boost robustness, and that CLIP-based backbones offer additional gains with careful training strategies. The work provides practical guidance for building robust BEV systems, introduces mCE and mRR as cross-corruption robustness metrics, and offers a publicly available benchmark and model zoo to spur future research in robust BEV perception.

Abstract

Recent advancements in bird's eye view (BEV) representations have shown remarkable promise for in-vehicle 3D perception. However, while these methods have achieved impressive results on standard benchmarks, their robustness in varied conditions remains insufficiently assessed. In this study, we present RoboBEV, an extensive benchmark suite designed to evaluate the resilience of BEV algorithms. This suite incorporates a diverse set of camera corruption types, each examined over three severity levels. Our benchmarks also consider the impact of complete sensor failures that occur when using multi-modal models. Through RoboBEV, we assess 33 state-of-the-art BEV-based perception models spanning tasks like detection, map segmentation, depth estimation, and occupancy prediction. Our analyses reveal a noticeable correlation between the model's performance on in-distribution datasets and its resilience to out-of-distribution challenges. Our experimental results also underline the efficacy of strategies like pre-training and depth-free BEV transformations in enhancing robustness against out-of-distribution data. Furthermore, we observe that leveraging extensive temporal information significantly improves the model's robustness. Based on our observations, we design an effective robustness enhancement strategy based on the CLIP model. The insights from this study pave the way for the development of future BEV models that seamlessly combine accuracy with real-world robustness.

Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving

TL;DR

The paper introduces RoboBEV, a large-scale robustness benchmark for BEV perception in autonomous driving, designed to test resilience under eight natural corruptions across three severity levels and complete sensor failures. It evaluates 33 BEV models across detection, map segmentation, depth estimation, and occupancy prediction, revealing a strong link between in-distribution performance and robustness while noting that higher clean scores do not guarantee better robustness. Key findings show that depth-free BEV transformations, model pre-training, and rich temporal information substantially boost robustness, and that CLIP-based backbones offer additional gains with careful training strategies. The work provides practical guidance for building robust BEV systems, introduces mCE and mRR as cross-corruption robustness metrics, and offers a publicly available benchmark and model zoo to spur future research in robust BEV perception.

Abstract

Recent advancements in bird's eye view (BEV) representations have shown remarkable promise for in-vehicle 3D perception. However, while these methods have achieved impressive results on standard benchmarks, their robustness in varied conditions remains insufficiently assessed. In this study, we present RoboBEV, an extensive benchmark suite designed to evaluate the resilience of BEV algorithms. This suite incorporates a diverse set of camera corruption types, each examined over three severity levels. Our benchmarks also consider the impact of complete sensor failures that occur when using multi-modal models. Through RoboBEV, we assess 33 state-of-the-art BEV-based perception models spanning tasks like detection, map segmentation, depth estimation, and occupancy prediction. Our analyses reveal a noticeable correlation between the model's performance on in-distribution datasets and its resilience to out-of-distribution challenges. Our experimental results also underline the efficacy of strategies like pre-training and depth-free BEV transformations in enhancing robustness against out-of-distribution data. Furthermore, we observe that leveraging extensive temporal information significantly improves the model's robustness. Based on our observations, we design an effective robustness enhancement strategy based on the CLIP model. The insights from this study pave the way for the development of future BEV models that seamlessly combine accuracy with real-world robustness.
Paper Structure (38 sections, 2 equations, 13 figures, 11 tables)

This paper contains 38 sections, 2 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: RoboBEV benchmark designs. The benchmark comprehensively encompasses four distinct BEV perception tasks (detection, segmentation, occupancy prediction, and depth estimation), four diverse sensor type configurations in between LiDAR, cameras, and joint setups (camera corruption, camera failure, and LiDAR failure), and an array of eight natural image corruptions (Brightness, Darkness, Fog, Snow, Motion Blur, Color Quantization, Camera Crash, and Frame Lost), each categorized into three distinct severity levels.
  • Figure 2: Histograms of pixel distributions for different corruption types. While certain corruptions exhibit minimal shifts in pixel distribution (e.g., Motion Blur), it is noteworthy that these alterations predominantly have adverse effects on the overall performance of the BEV perception systems.
  • Figure 3: (a): The mCE metric shows a linear relationship with "clean" performance while (b): the mRR metric confronts the risk of decreasing. (c): We observe strong correlations where large depth estimation errors under Snow and Dark tend to cause drastic performance drops.
  • Figure 4: Two steps to transfer CLIP radford2021learning robustness to BEVDet huang2021bevdet. The first step is to align the detection head to the frozen CLIP backbone with corruption-augmented inputs. The second step end-to-end fine-tunes the backbone and detection head to enhance the robustness.
  • Figure 5: Depth estimation results of BEVDepth li2022bevdepth under different corruption types. The results exhibit a different sensitivity for each scenario.
  • ...and 8 more figures