Testing for Fault Diversity in Reinforcement Learning

Quentin Mazouni; Helge Spieker; Arnaud Gotlieb; Mathieu Acher

Testing for Fault Diversity in Reinforcement Learning

Quentin Mazouni, Helge Spieker, Arnaud Gotlieb, Mathieu Acher

TL;DR

The paper addresses the problem of validating reinforcement learning policies by detecting and characterising faults, arguing that fault diversity provides more insight and trust than sheer fault counts. It reframes policy testing as a Quality Diversity ($QD$) optimisation task and compares MAP-Elites and Novelty Search to a state-of-the-art policy-testing framework (MDPFuzz) and Random Testing across Lunar Lander, Bipedal Walker, and Taxi. The authors show that $QD$-based testing can reveal a broader and more informative set of faults without increasing test budgets, though the effectiveness of Novelty Search can be unstable and highly dependent on the chosen behaviour space. They also demonstrate that the choice of behaviour space meaningfully affects fault discovery, with some spaces yielding more robust and diverse fault coverage than others. Overall, the work opens a new application area for $QD$ in fault-detection testing for RL and provides guidance on when and how to use $QD$ methods for diverse fault discovery.

Abstract

Reinforcement Learning is the premier technique to approach sequential decision problems, including complex tasks such as driving cars and landing spacecraft. Among the software validation and verification practices, testing for functional fault detection is a convenient way to build trustworthiness in the learned decision model. While recent works seek to maximise the number of detected faults, none consider fault characterisation during the search for more diversity. We argue that policy testing should not find as many failures as possible (e.g., inputs that trigger similar car crashes) but rather aim at revealing as informative and diverse faults as possible in the model. In this paper, we explore the use of quality diversity optimisation to solve the problem of fault diversity in policy testing. Quality diversity (QD) optimisation is a type of evolutionary algorithm to solve hard combinatorial optimisation problems where high-quality diverse solutions are sought. We define and address the underlying challenges of adapting QD optimisation to the test of action policies. Furthermore, we compare classical QD optimisers to state-of-the-art frameworks dedicated to policy testing, both in terms of search efficiency and fault diversity. We show that QD optimisation, while being conceptually simple and generally applicable, finds effectively more diverse faults in the decision model, and conclude that QD-based policy testing is a promising approach.

Testing for Fault Diversity in Reinforcement Learning

TL;DR

) optimisation task and compares MAP-Elites and Novelty Search to a state-of-the-art policy-testing framework (MDPFuzz) and Random Testing across Lunar Lander, Bipedal Walker, and Taxi. The authors show that

-based testing can reveal a broader and more informative set of faults without increasing test budgets, though the effectiveness of Novelty Search can be unstable and highly dependent on the chosen behaviour space. They also demonstrate that the choice of behaviour space meaningfully affects fault discovery, with some spaces yielding more robust and diverse fault coverage than others. Overall, the work opens a new application area for

in fault-detection testing for RL and provides guidance on when and how to use

methods for diverse fault discovery.

Abstract

Paper Structure (41 sections, 1 equation, 4 figures, 1 algorithm)

This paper contains 41 sections, 1 equation, 4 figures, 1 algorithm.

Introduction
Related Work
Background
Reinforcement Learning for sequential decision-making
Quality Diversity
QD Optimisation for Policy Testing
Solution Behaviour
Solution Quality
Assumptions
QD-based Policy Testing
Experimental Evaluation
Research Questions
Experiments
Environments
Lunar Lander
...and 26 more sections

Figures (4)

Figure 1: Evolution of the number of fault-triggering solutions found for each framework evaluated. The lines show the median results over 10 executions, and the shaded areas correspond to the first and third quantiles.
Figure 2: Evolution of the behaviour space coverage over time as the number of behaviour niches (bins) illuminated during testing. In the second column, only bins filled by fault-triggering solutions are counted, i.e., faulty behaviours. The lines show the median results over 10 executions, and the shaded areas correspond to the first and third quantiles.
Figure 3: Final state diversity as the average distances of the 3 nearest neighbours. Since their scale depends on the observation space of each use-case, we report the relative performance of the methodologies to Random Testing. The lines show the median results over 10 executions.
Figure 4: Impact of the behaviour space parameter for the Bipedal Walker experiments. The four spaces are different pairs of hand-designed descriptors studied in 10.1145/3377929.3389921. Each column displays the results for a behaviour space. From top to bottom: number of faults, number of behaviours and faulty behaviours, final and failure state diversity (FS and FFS) relative to Random Testing. The results are the medians of 10 executions.

Testing for Fault Diversity in Reinforcement Learning

TL;DR

Abstract

Testing for Fault Diversity in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)