How does Simulation-based Testing for Self-driving Cars match Human Perception?
Christian Birchler, Tanzil Kombarabettu Mohammed, Pooja Rani, Teodora Nechita, Timo Kehrer, Sebastiano Panichella
TL;DR
The paper tackles a critical gap in self-driving car testing: whether the widely used out-of-bound ($O_{OB}$) safety metric aligns with human perceptions of safety and realism in simulated scenarios. It introduces SDC-Alabaster, a VR-enabled Human-in-the-Loop framework, and evaluates 50 participants across two leading simulators (BeamNG.tech and CARLA) to examine how test complexity, interaction, and immersion influence safety judgments and realism. Findings show that human safety perception diverges from $O_{OB}$ in more complex or interactive contexts, and that realism is strongly affected by factors such as viewpoint and environment, necessitating richer, human-aligned safety metrics and realism considerations. These insights underscore the reality-gap problem in simulation-based SDC testing and offer a taxonomy of realism factors to guide future evaluation frameworks and practical testing pipelines.
Abstract
Software metrics such as coverage and mutation scores have been extensively explored for the automated quality assessment of test suites. While traditional tools rely on such quantifiable software metrics, the field of self-driving cars (SDCs) has primarily focused on simulation-based test case generation using quality metrics such as the out-of-bound (OOB) parameter to determine if a test case fails or passes. However, it remains unclear to what extent this quality metric aligns with the human perception of the safety and realism of SDCs, which are critical aspects in assessing SDC behavior. To address this gap, we conducted an empirical study involving 50 participants to investigate the factors that determine how humans perceive SDC test cases as safe, unsafe, realistic, or unrealistic. To this aim, we developed a framework leveraging virtual reality (VR) technologies, called SDC-Alabaster, to immerse the study participants into the virtual environment of SDC simulators. Our findings indicate that the human assessment of the safety and realism of failing and passing test cases can vary based on different factors, such as the test's complexity and the possibility of interacting with the SDC. Especially for the assessment of realism, the participants' age as a confounding factor leads to a different perception. This study highlights the need for more research on SDC simulation testing quality metrics and the importance of human perception in evaluating SDC behavior.
