Simulator Ensembles for Trustworthy Autonomous Driving Testing
Lev Sorokin, Matteo Biagiola, Andrea Stocco
TL;DR
This work tackles the reliability challenges of simulation-based ADAS testing by introducing MultiSim, a cross-simulator, search-based testing framework that treats simulator disagreement as a first-class signal. By evaluating test scenarios across multiple simulators in a unified optimization, MultiSim identifies simulator-agnostic (valid) failures and reduces simulator-specific flakiness. The approach uses a model-based road representation with Catmull–Rom interpolation, a multi-objective fitness vector across simulators, and a disagreement-predictor surrogate to prune low-value evaluations, achieving substantial gains in simulator-agnostic failures (about 70% validity on average) and up to 38% improvements in efficiency when predicting disagreements. The results across three lane-keeping ADAS and three simulators show that optimal simulator pairings depend on the SUT, with BD (BeamNG–Donkey) and BU (BeamNG–Udacity) often delivering the strongest performance, and the method generalizes beyond lane-keeping toward broader ADS testing challenges.
Abstract
Scenario-based testing with driving simulators is extensively used to identify failing conditions of automated driving assistance systems (ADAS). However, existing studies have shown that repeated test execution in the same as well as in distinct simulators can yield different outcomes, which can be attributed to sources of flakiness or different implementations of the physics. In this paper, we present MultiSim, a novel approach to multi-simulation ADAS testing based on a search-based testing approach that leverages an ensemble of simulators to identify failure-inducing, simulator-agnostic test scenarios. During the search, each scenario is evaluated jointly on multiple simulators. Scenarios that produce consistent results across simulators are prioritized for further exploration, while those that fail on only a subset of simulators are given less priority, as they may reflect simulator-specific issues rather than generalizable failures. Our empirical study, which involves testing three lane-keeping ADAS on different pairs of three widely used simulators, demonstrates that MultiSim outperforms single-simulator testing by achieving, on average, a higher rate of simulator-agnostic failures by 66%. Compared to a state-of-the-art multi-simulator approach that combines the outcome of independent test generation campaigns obtained in different simulators, MultiSim identifies, on average, up to 3.4X more simulator-agnostic failing tests and higher failure rates. To avoid the costly execution of test inputs on which simulators disagree, we propose to predict simulator disagreements and bypass test executions. Our results show that utilizing a surrogate model during the search retains the average number of valid failures and also improves efficiency. Our findings indicate that combining an ensemble of simulators is a promising approach for the automated cross-replication in ADAS testing.
