On Generating Explanations for Reinforcement Learning Policies: An Empirical Study
Mikihisa Yuasa, Huy T. Tran, Ramavarapu S. Sreenivas
TL;DR
This work tackles the explainability gap of reinforcement learning policies by introducing a class of linear temporal logic explanations and a greedy local-search to identify the best explanation for a given target policy. Explanations are expressed as $φ = F(φ_F) ∧ G(φ_G)$ and are evaluated by translating each candidate into an FSPA-augmented MDP and optimizing a policy under the candidate's reward structure, then comparing it to the target policy via a weighted KL divergence $U^{φ}$. The approach is validated in three simulated domains (CtF, car parking, and robot navigation) with PPO and SAC+HER, demonstrating the ability to recover target explanations and to propose plausible alternatives, while ablation tests reveal the importance of expansion/extension and weighting in avoiding local optima. The work discusses limitations such as computational complexity and dependency on predefined predicates, and outlines future directions including predicate automation, natural-language rendering, and scaling through neural LTL representations to enhance practical applicability in safety-critical systems.
Abstract
Understanding a \textit{reinforcement learning} policy, which guides state-to-action mappings to maximize rewards, necessitates an accompanying explanation for human comprehension. In this paper, we introduce a set of \textit{linear temporal logic} formulae designed to provide explanations for policies, and an algorithm for searching through those formulae for the one that best explains a given policy. Our focus is on explanations that elucidate both the ultimate objectives accomplished by the policy and the prerequisite conditions it upholds throughout its execution. The effectiveness of our proposed approach is illustrated through a simulated game of capture-the-flag and a car-parking environment,
