Co-Activation Graph Analysis of Safety-Verified and Explainable Deep Reinforcement Learning Policies
Dennis Gross, Helge Spieker
TL;DR
This paper tackles safety and interpretability in deep reinforcement learning by fusing policy model checking with co-activation graph analysis. It generates safety-labeled state datasets via probabilistic model checking and applies neuron-activation graphs to extract global-to-local explanations of safe decision-making, yielding semi-global insights. Key contributions include a practical methodology for labeling RL states using $PCTL$-based verification, and demonstrating how PageRank and Louvain community detection reveal influential neurons and modular structure tied to safety properties. The approach provides a scalable path to understand and trust RL policies in safety-critical contexts, with empirical validation in taxi and cleaning-robot domains. Future work points to extending this framework to multi-agent settings and safe neural pruning.
Abstract
Deep reinforcement learning (RL) policies can demonstrate unsafe behaviors and are challenging to interpret. To address these challenges, we combine RL policy model checking--a technique for determining whether RL policies exhibit unsafe behaviors--with co-activation graph analysis--a method that maps neural network inner workings by analyzing neuron activation patterns--to gain insight into the safe RL policy's sequential decision-making. This combination lets us interpret the RL policy's inner workings for safe decision-making. We demonstrate its applicability in various experiments.
