Table of Contents
Fetching ...

A Survey on Explainable Reinforcement Learning: Concepts, Algorithms, Challenges

Yunpeng Qing, Shunyu Liu, Jie Song, Yang Zhou, Kaixuan Chen, Huiqiong Wang, Mingli Song

TL;DR

This survey addresses the need for explainability in reinforcement learning by introducing a taxonomy that categorizes XRL methods by the central RL component they explain: agent model, reward, state, and task. It defines XRL, outlines subjective and objective evaluation frameworks, and surveys a broad range of techniques—from self-explainable architectures to post-hoc explanations and from reward decomposition to hierarchical task explanations. A notable focus is the integration of human knowledge (e.g., fuzzy rules, language-guided rewards, and annotated subtasks) as a promising but underexplored avenue. The paper also discusses challenges such as standardization of definitions and metrics, multi-part explainability, and balancing explainability with learning efficiency, and it provides guidance for future research and practical applications in diverse domains.

Abstract

Reinforcement Learning (RL) is a popular machine learning paradigm where intelligent agents interact with the environment to fulfill a long-term goal. Driven by the resurgence of deep learning, Deep RL (DRL) has witnessed great success over a wide spectrum of complex control tasks. Despite the encouraging results achieved, the deep neural network-based backbone is widely deemed as a black box that impedes practitioners to trust and employ trained agents in realistic scenarios where high security and reliability are essential. To alleviate this issue, a large volume of literature devoted to shedding light on the inner workings of the intelligent agents has been proposed, by constructing intrinsic interpretability or post-hoc explainability. In this survey, we provide a comprehensive review of existing works on eXplainable RL (XRL) and introduce a new taxonomy where prior works are clearly categorized into model-explaining, reward-explaining, state-explaining, and task-explaining methods. We also review and highlight RL methods that conversely leverage human knowledge to promote learning efficiency and performance of agents while this kind of method is often ignored in XRL field. Some challenges and opportunities in XRL are discussed. This survey intends to provide a high-level summarization of XRL and to motivate future research on more effective XRL solutions. Corresponding open source codes are collected and categorized at https://github.com/Plankson/awesome-explainable-reinforcement-learning.

A Survey on Explainable Reinforcement Learning: Concepts, Algorithms, Challenges

TL;DR

This survey addresses the need for explainability in reinforcement learning by introducing a taxonomy that categorizes XRL methods by the central RL component they explain: agent model, reward, state, and task. It defines XRL, outlines subjective and objective evaluation frameworks, and surveys a broad range of techniques—from self-explainable architectures to post-hoc explanations and from reward decomposition to hierarchical task explanations. A notable focus is the integration of human knowledge (e.g., fuzzy rules, language-guided rewards, and annotated subtasks) as a promising but underexplored avenue. The paper also discusses challenges such as standardization of definitions and metrics, multi-part explainability, and balancing explainability with learning efficiency, and it provides guidance for future research and practical applications in diverse domains.

Abstract

Reinforcement Learning (RL) is a popular machine learning paradigm where intelligent agents interact with the environment to fulfill a long-term goal. Driven by the resurgence of deep learning, Deep RL (DRL) has witnessed great success over a wide spectrum of complex control tasks. Despite the encouraging results achieved, the deep neural network-based backbone is widely deemed as a black box that impedes practitioners to trust and employ trained agents in realistic scenarios where high security and reliability are essential. To alleviate this issue, a large volume of literature devoted to shedding light on the inner workings of the intelligent agents has been proposed, by constructing intrinsic interpretability or post-hoc explainability. In this survey, we provide a comprehensive review of existing works on eXplainable RL (XRL) and introduce a new taxonomy where prior works are clearly categorized into model-explaining, reward-explaining, state-explaining, and task-explaining methods. We also review and highlight RL methods that conversely leverage human knowledge to promote learning efficiency and performance of agents while this kind of method is often ignored in XRL field. Some challenges and opportunities in XRL are discussed. This survey intends to provide a high-level summarization of XRL and to motivate future research on more effective XRL solutions. Corresponding open source codes are collected and categorized at https://github.com/Plankson/awesome-explainable-reinforcement-learning.
Paper Structure (41 sections, 6 equations, 4 figures, 6 tables)

This paper contains 41 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: An overview of the survey. We categorize existing explainable reinforcement learning (XRL) approaches into four branches based on the explainability of different parts in RL process: agent model, reward, state, and task. The more fine-grained categorization will be discussed detailedly in later sections. Each category is demonstrated with a part of representative works in the figure with different colors.
  • Figure 2: Diagrams of different types of XRL frameworks. These diagrams illustrate how different types of XRL make different parts of the RL model produce explanations. (a) constructs the agent on an explainable model to illustrate the inner mechanism. (b) reconstructs reward function $r$ towards an explainable one $r'$, which is constructed by quantifying the quantitative impact of various key factors $\{w_i\}$. (c) adds a state analyzer submodule to quantify the influences of state features for each state input $s$. (d) gets an architectural level explainability in complex tasks by task division and subtask signal $g$.
  • Figure 3: Examples of Self-Explainable Policy Architectures: (a) Programmatic reinforcement learning frameworks inala2020neurosymbolictrivedi2021learningPIRLverma2019imitation; (b) Decision tree policy construction by transforming VIPERmilani2022maviper or shaping LMUTConservative_Q_Improvement.
  • Figure 4: Examples of state importance extraction techniques via (a) intrinsic architectures leurent2019socialtang2021sensoryannasamy2019towardsneuroevolutionpeng2022inherently and (b) extrinsic pertubations RS-rainbowpetsiuk2018riseperturbation-based-saliencyobject-saliency-mapbertoin2022look.