Testing for Fault Diversity in Reinforcement Learning
Quentin Mazouni, Helge Spieker, Arnaud Gotlieb, Mathieu Acher
TL;DR
The paper addresses the problem of validating reinforcement learning policies by detecting and characterising faults, arguing that fault diversity provides more insight and trust than sheer fault counts. It reframes policy testing as a Quality Diversity ($QD$) optimisation task and compares MAP-Elites and Novelty Search to a state-of-the-art policy-testing framework (MDPFuzz) and Random Testing across Lunar Lander, Bipedal Walker, and Taxi. The authors show that $QD$-based testing can reveal a broader and more informative set of faults without increasing test budgets, though the effectiveness of Novelty Search can be unstable and highly dependent on the chosen behaviour space. They also demonstrate that the choice of behaviour space meaningfully affects fault discovery, with some spaces yielding more robust and diverse fault coverage than others. Overall, the work opens a new application area for $QD$ in fault-detection testing for RL and provides guidance on when and how to use $QD$ methods for diverse fault discovery.
Abstract
Reinforcement Learning is the premier technique to approach sequential decision problems, including complex tasks such as driving cars and landing spacecraft. Among the software validation and verification practices, testing for functional fault detection is a convenient way to build trustworthiness in the learned decision model. While recent works seek to maximise the number of detected faults, none consider fault characterisation during the search for more diversity. We argue that policy testing should not find as many failures as possible (e.g., inputs that trigger similar car crashes) but rather aim at revealing as informative and diverse faults as possible in the model. In this paper, we explore the use of quality diversity optimisation to solve the problem of fault diversity in policy testing. Quality diversity (QD) optimisation is a type of evolutionary algorithm to solve hard combinatorial optimisation problems where high-quality diverse solutions are sought. We define and address the underlying challenges of adapting QD optimisation to the test of action policies. Furthermore, we compare classical QD optimisers to state-of-the-art frameworks dedicated to policy testing, both in terms of search efficiency and fault diversity. We show that QD optimisation, while being conceptually simple and generally applicable, finds effectively more diverse faults in the decision model, and conclude that QD-based policy testing is a promising approach.
