Predicting Safety Misbehaviours in Autonomous Driving Systems using Uncertainty Quantification
Ruben Grewal, Paolo Tonella, Andrea Stocco
TL;DR
The paper tackles the challenge of predicting safety misbehaviours in end-to-end lane-keeping autonomous driving systems by leveraging uncertainty quantification (UQ). It empirically compares Monte Carlo Dropout and Deep Ensembles, showing that Deep Ensembles deliver higher detection accuracy (F$_3$ up to $0.94$) with manageable latency, while MC-Dropout remains competitive in resource-constrained settings. Across three benchmarks (OODextreme, OODmoderate, and Mutants) the UQ-based approach outperforms black-box SelfOracle and XAI-based ThirdEye in both effectiveness and efficiency, often predicting failures several seconds in advance with few or no false alarms. The study demonstrates that integrating uncertainty-based monitors into ADS testing can enable reliable, real-time fail-safe mechanisms and informs design choices for runtime safety supervision in deep neural network–driven autonomous vehicles.
Abstract
The automated real-time recognition of unexpected situations plays a crucial role in the safety of autonomous vehicles, especially in unsupported and unpredictable scenarios. This paper evaluates different Bayesian uncertainty quantification methods from the deep learning domain for the anticipatory testing of safety-critical misbehaviours during system-level simulation-based testing. Specifically, we compute uncertainty scores as the vehicle executes, following the intuition that high uncertainty scores are indicative of unsupported runtime conditions that can be used to distinguish safe from failure-inducing driving behaviors. In our study, we conducted an evaluation of the effectiveness and computational overhead associated with two Bayesian uncertainty quantification methods, namely MC- Dropout and Deep Ensembles, for misbehaviour avoidance. Overall, for three benchmarks from the Udacity simulator comprising both out-of-distribution and unsafe conditions introduced via mutation testing, both methods successfully detected a high number of out-of-bounds episodes providing early warnings several seconds in advance, outperforming two state-of-the-art misbehaviour prediction methods based on autoencoders and attention maps in terms of effectiveness and efficiency. Notably, Deep Ensembles detected most misbehaviours without any false alarms and did so even when employing a relatively small number of models, making them computationally feasible for real-time detection. Our findings suggest that incorporating uncertainty quantification methods is a viable approach for building fail-safe mechanisms in deep neural network-based autonomous vehicles.
