Table of Contents
Fetching ...

Combining AI Control Systems and Human Decision Support via Robustness and Criticality

Walt Woods, Alexander Grushin, Simon Khan, Alvaro Velasquez

TL;DR

A methodology for Adversarial Explanations (AE) is extended to state-of-the-art reinforcement learning frameworks, including MuZero, and it is shown that the learned AI control system demonstrates robustness against adversarial tampering.

Abstract

AI-enabled capabilities are reaching the requisite level of maturity to be deployed in the real world, yet do not always make correct or safe decisions. One way of addressing these concerns is to leverage AI control systems alongside and in support of human decisions, relying on the AI control system in safe situations while calling on a human co-decider for critical situations. We extend a methodology for adversarial explanations (AE) to state-of-the-art reinforcement learning frameworks, including MuZero. Multiple improvements to the base agent architecture are proposed. We demonstrate how this technology has two applications: for intelligent decision tools and to enhance training / learning frameworks. In a decision support context, adversarial explanations help a user make the correct decision by highlighting those contextual factors that would need to change for a different AI-recommended decision. As another benefit of adversarial explanations, we show that the learned AI control system demonstrates robustness against adversarial tampering. Additionally, we supplement AE by introducing strategically similar autoencoders (SSAs) to help users identify and understand all salient factors being considered by the AI system. In a training / learning framework, this technology can improve both the AI's decisions and explanations through human interaction. Finally, to identify when AI decisions would most benefit from human oversight, we tie this combined system to our prior art on statistically verified analyses of the criticality of decisions at any point in time.

Combining AI Control Systems and Human Decision Support via Robustness and Criticality

TL;DR

A methodology for Adversarial Explanations (AE) is extended to state-of-the-art reinforcement learning frameworks, including MuZero, and it is shown that the learned AI control system demonstrates robustness against adversarial tampering.

Abstract

AI-enabled capabilities are reaching the requisite level of maturity to be deployed in the real world, yet do not always make correct or safe decisions. One way of addressing these concerns is to leverage AI control systems alongside and in support of human decisions, relying on the AI control system in safe situations while calling on a human co-decider for critical situations. We extend a methodology for adversarial explanations (AE) to state-of-the-art reinforcement learning frameworks, including MuZero. Multiple improvements to the base agent architecture are proposed. We demonstrate how this technology has two applications: for intelligent decision tools and to enhance training / learning frameworks. In a decision support context, adversarial explanations help a user make the correct decision by highlighting those contextual factors that would need to change for a different AI-recommended decision. As another benefit of adversarial explanations, we show that the learned AI control system demonstrates robustness against adversarial tampering. Additionally, we supplement AE by introducing strategically similar autoencoders (SSAs) to help users identify and understand all salient factors being considered by the AI system. In a training / learning framework, this technology can improve both the AI's decisions and explanations through human interaction. Finally, to identify when AI decisions would most benefit from human oversight, we tie this combined system to our prior art on statistically verified analyses of the criticality of decisions at any point in time.
Paper Structure (30 sections, 1 equation, 12 figures)

This paper contains 30 sections, 1 equation, 12 figures.

Figures (12)

  • Figure 1: Overview of our idt (idt) effort. By combining base rl (rl) improvements with the developments of ae (ae), ssa (ssa), and safety margins, users are shown detailed visual descriptions of key decision components, and provided tools to understand the long-term consequences of potential decisions. Since this information is guided by possible decisions as considered by the AI system, users have the opportunity to explore phenomena that would otherwise be unknown to the user.
  • Figure 2: ae (ae) provide a methodology for hardening systems against subtle manipulations and then using the hardened systems to find accurate, nearby decision boundaries that are semantically meaningful to human observers. The leftmost frame represents the original, unperturbed input; the agent chooses to move to the right, in order to avoid the obstacle to its left. The middle frame shows a perturbation to the input, made such that the agent chooses to move to the left instead; here, the obstacle has shifted to the right, as illustrated via red circles. The rightmost frame shows a perturbation made in order for the agent to keep moving forward (rather than left or right); here, the obstacle disappears altogether. These perturbations suggest that the agent's decision is influenced directly by the obstacle.
  • Figure 3: ssa (ssa) differ from standard autoencoders -- rather than reconstructing the input, they reconstruct a family of inputs which the AI interprets as the same (reconstructing the same latent code). Through this process, data patterns that define the current strategy can be identified. These patterns are complementary to the patterns identified through . As for , the leftmost frame represents the original input, while the middle and right frames represent two different reconstructions of the input. Note that the obstacle and two enemy ships persist in both reconstructions, suggesting that they are important to the agent.
  • Figure 4: Safety margins (shown in the bottom right corner of the figure) capture the relationship between the number of consecutive mistakes (more formally, random actions) that an agent might hypothetically make and the resulting expected loss of reward (note that higher reward losses appear lower along the vertical axis). To derive this relationship at some time $t$, we first run a simulation beginning at $t$, and measure the total discounted reward collected. We then repeat the simulation multiple times, but for some number of time steps $n$ beginning with time step $t$, replace the actions output by the agent's policy with randomly-chosen actions, and measure the mean loss (reduction) in reward as a result of these random actions; this is the true criticality at time $t$. We also use a much faster, but typically much less accurate approach called a proxy criticality metric to compute the proxy (approximate) criticality at time $t$ without running any simulations lin2017. This aforementioned process is repeated offline for many episodes, at different time steps, in order to collect a dataset containing proxy and true criticality values. From this dataset, kernel density estimation is performed to capture the statistical relationship between proxy and true criticality for each value $n$. Finally, the kernel density estimates are used to compute a table of safety margins. When the agent is deployed, only proxy criticality needs to be computed at any given time $t$; it is then used to efficiently look up the safety margins for $t$ within the table, without the need to run additional simulations.
  • Figure 5: Safety margins (color) given a specified acceptable loss in mission outcome (y-axis) and proxy criticality (x-axis).
  • ...and 7 more figures