Predicting Safety Misbehaviours in Autonomous Driving Systems using Uncertainty Quantification

Ruben Grewal; Paolo Tonella; Andrea Stocco

Predicting Safety Misbehaviours in Autonomous Driving Systems using Uncertainty Quantification

Ruben Grewal, Paolo Tonella, Andrea Stocco

TL;DR

The paper tackles the challenge of predicting safety misbehaviours in end-to-end lane-keeping autonomous driving systems by leveraging uncertainty quantification (UQ). It empirically compares Monte Carlo Dropout and Deep Ensembles, showing that Deep Ensembles deliver higher detection accuracy (F$_3$ up to $0.94$) with manageable latency, while MC-Dropout remains competitive in resource-constrained settings. Across three benchmarks (OODextreme, OODmoderate, and Mutants) the UQ-based approach outperforms black-box SelfOracle and XAI-based ThirdEye in both effectiveness and efficiency, often predicting failures several seconds in advance with few or no false alarms. The study demonstrates that integrating uncertainty-based monitors into ADS testing can enable reliable, real-time fail-safe mechanisms and informs design choices for runtime safety supervision in deep neural network–driven autonomous vehicles.

Abstract

The automated real-time recognition of unexpected situations plays a crucial role in the safety of autonomous vehicles, especially in unsupported and unpredictable scenarios. This paper evaluates different Bayesian uncertainty quantification methods from the deep learning domain for the anticipatory testing of safety-critical misbehaviours during system-level simulation-based testing. Specifically, we compute uncertainty scores as the vehicle executes, following the intuition that high uncertainty scores are indicative of unsupported runtime conditions that can be used to distinguish safe from failure-inducing driving behaviors. In our study, we conducted an evaluation of the effectiveness and computational overhead associated with two Bayesian uncertainty quantification methods, namely MC- Dropout and Deep Ensembles, for misbehaviour avoidance. Overall, for three benchmarks from the Udacity simulator comprising both out-of-distribution and unsafe conditions introduced via mutation testing, both methods successfully detected a high number of out-of-bounds episodes providing early warnings several seconds in advance, outperforming two state-of-the-art misbehaviour prediction methods based on autoencoders and attention maps in terms of effectiveness and efficiency. Notably, Deep Ensembles detected most misbehaviours without any false alarms and did so even when employing a relatively small number of models, making them computationally feasible for real-time detection. Our findings suggest that incorporating uncertainty quantification methods is a viable approach for building fail-safe mechanisms in deep neural network-based autonomous vehicles.

Predicting Safety Misbehaviours in Autonomous Driving Systems using Uncertainty Quantification

TL;DR

up to

) with manageable latency, while MC-Dropout remains competitive in resource-constrained settings. Across three benchmarks (OODextreme, OODmoderate, and Mutants) the UQ-based approach outperforms black-box SelfOracle and XAI-based ThirdEye in both effectiveness and efficiency, often predicting failures several seconds in advance with few or no false alarms. The study demonstrates that integrating uncertainty-based monitors into ADS testing can enable reliable, real-time fail-safe mechanisms and informs design choices for runtime safety supervision in deep neural network–driven autonomous vehicles.

Abstract

Paper Structure (37 sections, 2 equations, 5 figures, 1 table)

This paper contains 37 sections, 2 equations, 5 figures, 1 table.

Introduction
Background
Lane-keeping ADS
Failure Conditions for Lane-keeping ADS
Existing Unsupervised Failure Predictions Methods
Deep Neural Networks Uncertainty Quantification Methods
Monte Carlo Dropout
Deep Ensembles
Implementation
Empirical Evaluation
Research Questions
Experimental Setup
ADS Under Test
Driving Simulator
Benchmark
...and 22 more sections

Figures (5)

Figure 1: Examples of operational conditions 2022-Stocco-ASE. Left: nominal (sunny). Center: OOD (night+snow). Right: OOD (snow).
Figure 2: Distribution approximated through MC-Dropout.
Figure 3: Distribution approximated through Deep Ensembles.
Figure 4: RQ1: $F_{3}$ scores for the best failure predictors across all confidence levels.
Figure 5: RQ4: Computational overhead (ms/iteration).

Predicting Safety Misbehaviours in Autonomous Driving Systems using Uncertainty Quantification

TL;DR

Abstract

Predicting Safety Misbehaviours in Autonomous Driving Systems using Uncertainty Quantification

Authors

TL;DR

Abstract

Table of Contents

Figures (5)