From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model Explanation

Peilang Li; Umer Siddique; Yongcan Cao

From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model Explanation

Peilang Li, Umer Siddique, Yongcan Cao

TL;DR

The paper tackles the challenge of interpreting deep reinforcement learning policies by bridging explainability and interpretability through a model-agnostic Shapley-value based framework. It introduces Shapley vectors to capture feature contributions, clusters states by action, identifies decision boundaries, and reconstructs an interpretable policy with boundary regression. The method applies to both off-policy and on-policy agents and is validated on CartPole and MountainCar with DQN, PPO, and A2C, showing comparable performance and improved stability. This work provides a practical path toward trustworthy RL in high-stakes settings and outlines extensions to continuous actions and scalability.

Abstract

Deep reinforcement learning (RL) has shown remarkable success in complex domains, however, the inherent black box nature of deep neural network policies raises significant challenges in understanding and trusting the decision-making processes. While existing explainable RL methods provide local insights, they fail to deliver a global understanding of the model, particularly in high-stakes applications. To overcome this limitation, we propose a novel model-agnostic approach that bridges the gap between explainability and interpretability by leveraging Shapley values to transform complex deep RL policies into transparent representations. The proposed approach offers two key contributions: a novel approach employing Shapley values to policy interpretation beyond local explanations and a general framework applicable to off-policy and on-policy algorithms. We evaluate our approach with three existing deep RL algorithms and validate its performance in two classic control environments. The results demonstrate that our approach not only preserves the original models' performance but also generates more stable interpretable policies.

From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model Explanation

TL;DR

Abstract

Paper Structure (18 sections, 13 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 13 equations, 4 figures, 2 tables, 1 algorithm.

Introduction
Contributions.
Related Work
Background
Reinforcement Learning
Shapley Values in Reinforcement Learning
Method
Shapley Vectors Analysis
Action K-Means Clustering.
Boundary Point Identification.
Interpretable Policy Formulation
Inverse Shapley Values.
Decision Boundary Regression.
Experiments
CartPole
...and 3 more sections

Figures (4)

Figure 1: Visualization of Shapley values and interpretable policy formulation in the CartPole. The first row depicts the Shapley value vectors for DQN, PPO, and A2C, with clusters represented in different colors and boundary points highlighted in red. The second row illustrates the corresponding interpretable policy in the original state space, showing decision boundaries that separate the state space into distinct action regions. (Due to the limitations of dimensional plotting, only the first three features $x, \dot{x}, \theta$ are visualized in the figure)
Figure 2: Performances of the interpretable policy with original algorithms---DQN, PPO, A2C in CartPole Environment
Figure 3: Visualization of Shapley values and interpretable policy formulation in the MountainCar. The first row depicts the Shapley value vectors for DQN, PPO, and A2C, with clusters represented in different colors and boundary points highlighted in red. The second row illustrates the corresponding interpretable policy in the original state space, showing decision boundaries that separate the state space into distinct action regions.
Figure 4: Performances of the interpretable policy with original algorithms—DQN, PPO, A2C in MountainCar Environment.

Theorems & Definitions (1)

proof

From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model Explanation

TL;DR

Abstract

From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model Explanation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (1)