Table of Contents
Fetching ...

Approximating Shapley Explanations in Reinforcement Learning

Daniel Beechey, Özgür Şimşek

TL;DR

FastSVERL provides a scalable, model-based framework to approximate Shapley explanations in reinforcement learning by learning a Shapley predictor that estimates per-feature contributions across states and actions. It replaces exact, combinatorial Shapley computations with amortised, differentiable LS losses over sampled subsets and states, and enforces the efficiency constraint to recover true Shapley values in the limit. The approach handles temporal dependencies, off-policy data through importance sampling, and continual learning by updating explanations in tandem with policy updates, demonstrating convergence and scalability in domains such as Mastermind and Gridworld. A key extension replaces costly characteristic models with single-sample approximations, further boosting efficiency while retaining unbiased explanations. Overall, FastSVERL delivers principled, real-time interpretability for RL with practical applicability to broader learning settings.

Abstract

Reinforcement learning has achieved remarkable success in complex decision-making environments, yet its lack of transparency limits its deployment in practice, especially in safety-critical settings. Shapley values from cooperative game theory provide a principled framework for explaining reinforcement learning; however, the computational cost of Shapley explanations is an obstacle to their use. We introduce FastSVERL, a scalable method for explaining reinforcement learning by approximating Shapley values. FastSVERL is designed to handle the unique challenges of reinforcement learning, including temporal dependencies across multi-step trajectories, learning from off-policy data, and adapting to evolving agent behaviours in real time. FastSVERL introduces a practical, scalable approach for principled and rigorous interpretability in reinforcement learning.

Approximating Shapley Explanations in Reinforcement Learning

TL;DR

FastSVERL provides a scalable, model-based framework to approximate Shapley explanations in reinforcement learning by learning a Shapley predictor that estimates per-feature contributions across states and actions. It replaces exact, combinatorial Shapley computations with amortised, differentiable LS losses over sampled subsets and states, and enforces the efficiency constraint to recover true Shapley values in the limit. The approach handles temporal dependencies, off-policy data through importance sampling, and continual learning by updating explanations in tandem with policy updates, demonstrating convergence and scalability in domains such as Mastermind and Gridworld. A key extension replaces costly characteristic models with single-sample approximations, further boosting efficiency while retaining unbiased explanations. Overall, FastSVERL delivers principled, real-time interpretability for RL with practical applicability to broader learning settings.

Abstract

Reinforcement learning has achieved remarkable success in complex decision-making environments, yet its lack of transparency limits its deployment in practice, especially in safety-critical settings. Shapley values from cooperative game theory provide a principled framework for explaining reinforcement learning; however, the computational cost of Shapley explanations is an obstacle to their use. We introduce FastSVERL, a scalable method for explaining reinforcement learning by approximating Shapley values. FastSVERL is designed to handle the unique challenges of reinforcement learning, including temporal dependencies across multi-step trajectories, learning from off-policy data, and adapting to evolving agent behaviours in real time. FastSVERL introduces a practical, scalable approach for principled and rigorous interpretability in reinforcement learning.

Paper Structure

This paper contains 43 sections, 52 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: How approximation accuracy improves with training updates in Mastermind-222. Shaded regions, which are negligible, indicate standard error over 20 runs. As we progress from plot (a) to (b) to (c), downstream models use exact or approximate upstream models from earlier plots.
  • Figure 2: Training updates (mean $\pm$ standard error over 20 runs) needed to reach a fixed target loss (0.01) when approximating behaviour explanations in Hypercube. Each subplot fixes the number of features $n$ (i.e. dimensions). Bar colour indicates cube side length $l$.
  • Figure 3: Approximation accuracy over training updates in Mastermind-222. Each line shows the mean squared error (MSE) between predicted and exact values, averaged over all states and features. Shaded regions indicate standard error over 20 runs, corrected for variance in agent training Masson2003.
  • Figure 4: How approximation accuracy improves over training updates for FastSVERL's explanation models across Mastermind-222 (left), Mastermind-333 (middle), and Gridworld (right). Each line shows the mean squared error (MSE) between predicted and exact values, averaged over all states and features. Shaded regions, which are negligible, indicate standard error over 20 runs. As you move down through the rows, downstream models (e.g. the behaviour Shapley model) use exact or approximate upstream models from earlier plots (e.g. the behaviour characteristic). Empty slots correspond to the outcome explanations that cannot feasibly be computed exactly in Mastermind-333.
  • Figure 5: Training batches required to reach a fixed target loss ($0.01$) when approximating characteristic and Shapley values for behaviour and prediction in Hypercube; standard error over 20 runs. Each subplot fixes the number of features $n$ (i.e. dimensions). Bar colour indicates cube side length $l$.
  • ...and 6 more figures