Automated Test Validators for Flaky Cyber-Physical System Simulators: Approach and Evaluation
Baharin A. Jodat, Khouloud Gaaloul, Mehrdad Sabetzadeh, Shiva Nejati
TL;DR
The paper tackles the inefficiency and unreliability of simulation-based CPS testing by introducing assertion-based test validators that filter out inputs likely not to meaningfully exercise the SUT. It presents GenTV, a data-driven pipeline that can generate validators via genetic programming with SBFL-based fitness (notably Ochiai) or via interpretable ML (DT/DR), plus a pruning step to ensure verdict consistency. The approach is extended to signal-based CPS with a formal logic L that can express common CPS properties and is shown to align with a large portion of industrial requirements. Empirical results across diverse case studies demonstrate high accuracy and robustness to simulator flakiness, with strong alignment to preconditions and ODD limits, suggesting substantial execution-time savings in practice.
Abstract
Simulation-based testing of cyber-physical systems (CPS) is costly due to the time-consuming execution of CPS simulators. In addition, CPS simulators may be flaky, leading to inconsistent test outcomes and requiring repeated test re-execution for reliable test verdicts. Many test inputs within the input space of CPS may not effectively exercise the behaviour of the system under test (SUT) -- for instance, those that violate system preconditions, exceed operational design domain (ODD) limits, or represent inherently safe scenarios. In this article, we propose to use test validators to filter out such test inputs before execution. We describe two methods for generating test validators: one using genetic programming (GP) that employs well-known spectrum-based fault localization (SBFL) ranking formulas, namely Ochiai, Tarantula, and Naish, as fitness functions; and the other using decision trees (DT) and decision rules (DR). We evaluate our test validators through case studies in the domains of aerospace, networking and autonomous driving. We show that test validators generated using GP with Ochiai are significantly more accurate than those generated using GP with Tarantula and Naish or using DT or DR. Moreover, this accuracy advantage remains even when accounting for the flakiness of the simulator. We further show that our test validators generated by GP with Ochiai are robust against flakiness with only 4% average variation in their accuracy results across four different network and autonomous-driving systems with flaky behaviours. Finally, we show that, on average, 88.7% of the assertions inferred by our approach align or overlap with requirements precondition violations, ODD-limit violations, and nominal safe conditions extracted from technical standards and empirical results in the literature.
