Reward Augmentation in Reinforcement Learning for Testing Distributed Systems

Andrea Borgarelli; Constantin Enea; Rupak Majumdar; Srinidhi Nagendra

Reward Augmentation in Reinforcement Learning for Testing Distributed Systems

Andrea Borgarelli, Constantin Enea, Rupak Majumdar, Srinidhi Nagendra

TL;DR

A randomized testing approach for distributed protocol implementations based on reinforcement learning that ensures that new episodes can reliably get to deep interesting states even without execution caching and can significantly outperform baseline approaches in terms of coverage and bug finding.

Abstract

Bugs in popular distributed protocol implementations have been the source of many downtimes in popular internet services. We describe a randomized testing approach for distributed protocol implementations based on reinforcement learning. Since the natural reward structure is very sparse, the key to successful exploration in reinforcement learning is reward augmentation. We show two different techniques that build on one another. First, we provide a decaying exploration bonus based on the discovery of new states -- the reward decays as the same state is visited multiple times. The exploration bonus captures the intuition from coverage-guided fuzzing of prioritizing new coverage points; in contrast to other schemes, we show that taking the maximum of the bonus and the Q-value leads to more effective exploration. Second, we provide waypoints to the algorithm as a sequence of predicates that capture interesting semantic scenarios. Waypoints exploit designer insight about the protocol and guide the exploration to ``interesting'' parts of the state space. Our reward structure ensures that new episodes can reliably get to deep interesting states even without execution caching. We have implemented our algorithm in Go. Our evaluation on three large benchmarks (RedisRaft, Etcd, and RSL) shows that our algorithm can significantly outperform baseline approaches in terms of coverage and bug finding.

Reward Augmentation in Reinforcement Learning for Testing Distributed Systems

TL;DR

Abstract

Paper Structure (24 sections, 2 equations, 5 figures, 7 tables, 4 algorithms)

This paper contains 24 sections, 2 equations, 5 figures, 7 tables, 4 algorithms.

Introduction
Reinforcement Learning with Coverage Bonus and Waypoints
Background: Reinforcement Learning and Q-Learning
BonusMaxRL
WaypointRL
Intuition: Exploring a Cube world
Environments from distributed systems
Defining the MDP
General modelling guidelines
Environment parameters
Predicate sequences
Deriving target predicates
Specifying intermediate predicates
Evaluation
Test setup
...and 9 more sections

Figures (5)

Figure 1: Exploration of a $6\times10\times10\times6$ cube world, with a given episode budget, using different agents. We plot the heatmap of the top of each cube. The intensity is the sum of the visited cells along the depth of the cube, with the darkest color meaning all the cells have been visited. Here we showcase several points. First, BonusMaxRL (b) achieves better exploration than Random (a), covering more cells. Second, unbiased exploration struggles to reach cubes away from the starting point (b). Third, chosing an appropriate state space abstraction can lead to better coverage, but it can result in reduced capabilities of systematically exploring a target subspace (c), while WaypointRL is able to effectively bias the exploration towards the target cube and almost fully cover it (d).
Figure 2: Detailed Exploration of cube 3 in the $6\times10\times10\times6$ cube world. Each grid represents a depth level of the cube. The colored cells have been explored by the agent. (a) BonusMaxRL using the depth abstraction (b) WaypointRL with reaching cube 1, 2, and 3 as waypoints. WaypointRL is able to explore almost all the cells of the cube.
Figure 3: Evolution of the transition system of a distributed system. (a) Fine state space. State is a map of local states and action is the set of messages to deliver and drop. (b) Coarse state space without symmetry reduction. (c) Coarse state space with symmetry reduction
Figure 4: The pure coverage comparison between NegRLVisits, BonusMaxRL and Random for the different benchmarks. Each plot contains the average coverage vs time steps
Figure 5: Different versions of predicate sequences for the same target coverage. For the predicate sequences, the legend specifies the target predicate and, between square brackets, the number of predicates in the sequence excluding the first one (true predicate).

Reward Augmentation in Reinforcement Learning for Testing Distributed Systems

TL;DR

Abstract

Reward Augmentation in Reinforcement Learning for Testing Distributed Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (5)