Table of Contents
Fetching ...

Simulator Ensembles for Trustworthy Autonomous Driving Testing

Lev Sorokin, Matteo Biagiola, Andrea Stocco

TL;DR

This work tackles the reliability challenges of simulation-based ADAS testing by introducing MultiSim, a cross-simulator, search-based testing framework that treats simulator disagreement as a first-class signal. By evaluating test scenarios across multiple simulators in a unified optimization, MultiSim identifies simulator-agnostic (valid) failures and reduces simulator-specific flakiness. The approach uses a model-based road representation with Catmull–Rom interpolation, a multi-objective fitness vector across simulators, and a disagreement-predictor surrogate to prune low-value evaluations, achieving substantial gains in simulator-agnostic failures (about 70% validity on average) and up to 38% improvements in efficiency when predicting disagreements. The results across three lane-keeping ADAS and three simulators show that optimal simulator pairings depend on the SUT, with BD (BeamNG–Donkey) and BU (BeamNG–Udacity) often delivering the strongest performance, and the method generalizes beyond lane-keeping toward broader ADS testing challenges.

Abstract

Scenario-based testing with driving simulators is extensively used to identify failing conditions of automated driving assistance systems (ADAS). However, existing studies have shown that repeated test execution in the same as well as in distinct simulators can yield different outcomes, which can be attributed to sources of flakiness or different implementations of the physics. In this paper, we present MultiSim, a novel approach to multi-simulation ADAS testing based on a search-based testing approach that leverages an ensemble of simulators to identify failure-inducing, simulator-agnostic test scenarios. During the search, each scenario is evaluated jointly on multiple simulators. Scenarios that produce consistent results across simulators are prioritized for further exploration, while those that fail on only a subset of simulators are given less priority, as they may reflect simulator-specific issues rather than generalizable failures. Our empirical study, which involves testing three lane-keeping ADAS on different pairs of three widely used simulators, demonstrates that MultiSim outperforms single-simulator testing by achieving, on average, a higher rate of simulator-agnostic failures by 66%. Compared to a state-of-the-art multi-simulator approach that combines the outcome of independent test generation campaigns obtained in different simulators, MultiSim identifies, on average, up to 3.4X more simulator-agnostic failing tests and higher failure rates. To avoid the costly execution of test inputs on which simulators disagree, we propose to predict simulator disagreements and bypass test executions. Our results show that utilizing a surrogate model during the search retains the average number of valid failures and also improves efficiency. Our findings indicate that combining an ensemble of simulators is a promising approach for the automated cross-replication in ADAS testing.

Simulator Ensembles for Trustworthy Autonomous Driving Testing

TL;DR

This work tackles the reliability challenges of simulation-based ADAS testing by introducing MultiSim, a cross-simulator, search-based testing framework that treats simulator disagreement as a first-class signal. By evaluating test scenarios across multiple simulators in a unified optimization, MultiSim identifies simulator-agnostic (valid) failures and reduces simulator-specific flakiness. The approach uses a model-based road representation with Catmull–Rom interpolation, a multi-objective fitness vector across simulators, and a disagreement-predictor surrogate to prune low-value evaluations, achieving substantial gains in simulator-agnostic failures (about 70% validity on average) and up to 38% improvements in efficiency when predicting disagreements. The results across three lane-keeping ADAS and three simulators show that optimal simulator pairings depend on the SUT, with BD (BeamNG–Donkey) and BU (BeamNG–Udacity) often delivering the strongest performance, and the method generalizes beyond lane-keeping toward broader ADS testing challenges.

Abstract

Scenario-based testing with driving simulators is extensively used to identify failing conditions of automated driving assistance systems (ADAS). However, existing studies have shown that repeated test execution in the same as well as in distinct simulators can yield different outcomes, which can be attributed to sources of flakiness or different implementations of the physics. In this paper, we present MultiSim, a novel approach to multi-simulation ADAS testing based on a search-based testing approach that leverages an ensemble of simulators to identify failure-inducing, simulator-agnostic test scenarios. During the search, each scenario is evaluated jointly on multiple simulators. Scenarios that produce consistent results across simulators are prioritized for further exploration, while those that fail on only a subset of simulators are given less priority, as they may reflect simulator-specific issues rather than generalizable failures. Our empirical study, which involves testing three lane-keeping ADAS on different pairs of three widely used simulators, demonstrates that MultiSim outperforms single-simulator testing by achieving, on average, a higher rate of simulator-agnostic failures by 66%. Compared to a state-of-the-art multi-simulator approach that combines the outcome of independent test generation campaigns obtained in different simulators, MultiSim identifies, on average, up to 3.4X more simulator-agnostic failing tests and higher failure rates. To avoid the costly execution of test inputs on which simulators disagree, we propose to predict simulator disagreements and bypass test executions. Our results show that utilizing a surrogate model during the search retains the average number of valid failures and also improves efficiency. Our findings indicate that combining an ensemble of simulators is a promising approach for the automated cross-replication in ADAS testing.

Paper Structure

This paper contains 50 sections, 3 equations, 8 figures, 6 tables, 2 algorithms.

Figures (8)

  • Figure 1: Difference in executing a lane-keeping ADAS on the same road in three different simulators, along with their rendering. In Udacity the trajectory of the vehicle (visualized in green, starting with a triangle) is within the lane's bounds while in Donkey and BeamNG the vehicle is departing off lane.
  • Figure 2: Illustration of crossover (a) and mutation (b) of roads in MultiSim including control points which are placed within the road (visualized as stars). For sake of simplicity only angles are modified. In Figure a) tails of roads after the third segment are exchanged. In Figure b) the angle of the last segment is increased by 10 degrees.
  • Figure 3: Example of a feature map with two features, namely number of turns (i.e., turn_count) and curvature. The color of a cell is defined based on the worst XTE value of test inputs stored in a cell (colorbar on the right-hand side of the map). A cell is green when the XTE value is 0, red for the maximum, in absolute value, XTE value of -3. A cell is white if there is no test input that covers it. The fitness is negative, because the fitness function is to be minimized. In total 412 tests are stored in the feature map, which has 13 failing-cells and 35 non-failing cells.
  • Figure 4: Validation of failure-inducing test inputs, i.e., $\mathbf{x}_1$, $\mathbf{x}_2$, $\mathbf{x}_3$. Each test is re-executed in multiple simulators, i.e., two in this case $S_1$ and $S_2$, and the failure rates for each test are computed. Then the failure rates on $S_1$ and $S_2$ are compared, and the test inputs are filtered according to a failure rate threshold. (in this case 100%). In this example, the only simulator-agnostic/valid failure-inducing test input is $\mathbf{x}_3$.
  • Figure 5: Validity rate (valid_rate) and number of valid failures (n_valid) identified for MultiSim, DSS, and SingleSim averaged for DAVE-2, ViT, and TCP. The average validity rate is shown in each bar plot.
  • ...and 3 more figures

Theorems & Definitions (2)

  • definition thmcounterdefinition
  • definition thmcounterdefinition