Table of Contents
Fetching ...

RESBev: Making BEV Perception More Robust

Lifeng Zhuo, Kefan Jin, Zhe Liu, Hesheng Wang

TL;DR

This work proposes a resilient and plug-and-play BEV perception method, RESBev, which can be easily applied to existing BEV perception methods to enhance their robustness to diverse disturbances and reframe perception robustness as a latent semantic prediction problem.

Abstract

Bird's-eye-view (BEV) perception has emerged as a cornerstone of autonomous driving systems, providing a structured, ego-centric representation critical for downstream planning and control. However, real-world deployment faces challenges from sensor degradation and adversarial attacks, which can cause severe perceptual anomalies and ultimately compromise the safety of autonomous driving systems. To address this, we propose a resilient and plug-and-play BEV perception method, RESBev, which can be easily applied to existing BEV perception methods to enhance their robustness to diverse disturbances. Specifically, we reframe perception robustness as a latent semantic prediction problem. A latent world model is constructed to extract spatiotemporal correlations across sequential BEV observations, thereby learning the underlying BEV state transitions to predict clean BEV features for reconstructing corrupted observations. The proposed framework operates at the semantic feature level of the Lift-Splat-Shoot pipeline, enabling recovery that generalizes across both natural disturbances and adversarial attacks without modifying the underlying backbone. Extensive experiments on the nuScenes dataset demonstrate that, with few-shot fine-tuning, RESBev significantly improves the robustness of existing BEV perception models against various external disturbances and adversarial attacks.

RESBev: Making BEV Perception More Robust

TL;DR

This work proposes a resilient and plug-and-play BEV perception method, RESBev, which can be easily applied to existing BEV perception methods to enhance their robustness to diverse disturbances and reframe perception robustness as a latent semantic prediction problem.

Abstract

Bird's-eye-view (BEV) perception has emerged as a cornerstone of autonomous driving systems, providing a structured, ego-centric representation critical for downstream planning and control. However, real-world deployment faces challenges from sensor degradation and adversarial attacks, which can cause severe perceptual anomalies and ultimately compromise the safety of autonomous driving systems. To address this, we propose a resilient and plug-and-play BEV perception method, RESBev, which can be easily applied to existing BEV perception methods to enhance their robustness to diverse disturbances. Specifically, we reframe perception robustness as a latent semantic prediction problem. A latent world model is constructed to extract spatiotemporal correlations across sequential BEV observations, thereby learning the underlying BEV state transitions to predict clean BEV features for reconstructing corrupted observations. The proposed framework operates at the semantic feature level of the Lift-Splat-Shoot pipeline, enabling recovery that generalizes across both natural disturbances and adversarial attacks without modifying the underlying backbone. Extensive experiments on the nuScenes dataset demonstrate that, with few-shot fine-tuning, RESBev significantly improves the robustness of existing BEV perception models against various external disturbances and adversarial attacks.
Paper Structure (28 sections, 5 equations, 5 figures, 6 tables)

This paper contains 28 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (a) By leveraging a latent world model, our method generates a predictive prior from past clean frames and fuses it with current corrupted observations to produce the final reconstructed BEV representation. (b) Performance comparison demonstrates that RESBev significantly enhances the baseline LSS model's resilience against diverse natural disturbances and adversarial attacks.
  • Figure 2: Analysis driving architectural choices. We justify operating in the BEV space and employing a generative mechanism. (a) demonstrates the spatiotemporal stability of BEV features compared to image features under environmental noise. (b) reveals that standard backbones are brittle to imperceptible attacks, necessitating a generative prior for restoration rather than simple filtering or aggregating.
  • Figure 3: The overall architecture of our proposed temporal fusion model. The model consists of two core components: Semantic Prior Predictor that predicts the current BEV state from the past, and Anomaly Reconstructor that fuses this prediction with current fused BEV features.
  • Figure 4: The probabilistic graphical model of our training framework.
  • Figure 5: t-SNE visualization of BEV features. The visualization of ten corrupted and clean features exhibits a radial corruption geometry where severity and semantic displacement are tightly coupled.