Table of Contents
Fetching ...

A Survey on Explainable Deep Reinforcement Learning

Zelei Cheng, Jiahao Yu, Xinyu Xing

TL;DR

This survey consolidates Explainable Deep Reinforcement Learning (XRL) by mapping four explanation levels—feature, state, data, and model—onto DRL and examining their role in policy transparency, debugging, and safety. It covers how XRL interacts with RL in the context of Large Language Models and RLHF, and surveys qualitative and quantitative evaluation frameworks, as well as applications in adversarial robustness and policy refinement. The authors highlight practical implications, security risks, and open challenges, calling for user- and developer-oriented explanations and improved integration with human-in-the-loop systems. Overall, the work provides a comprehensive blueprint for developing interpretable, trustworthy DRL systems and outlines concrete directions for future research.

Abstract

Deep Reinforcement Learning (DRL) has achieved remarkable success in sequential decision-making tasks across diverse domains, yet its reliance on black-box neural architectures hinders interpretability, trust, and deployment in high-stakes applications. Explainable Deep Reinforcement Learning (XRL) addresses these challenges by enhancing transparency through feature-level, state-level, dataset-level, and model-level explanation techniques. This survey provides a comprehensive review of XRL methods, evaluates their qualitative and quantitative assessment frameworks, and explores their role in policy refinement, adversarial robustness, and security. Additionally, we examine the integration of reinforcement learning with Large Language Models (LLMs), particularly through Reinforcement Learning from Human Feedback (RLHF), which optimizes AI alignment with human preferences. We conclude by highlighting open research challenges and future directions to advance the development of interpretable, reliable, and accountable DRL systems.

A Survey on Explainable Deep Reinforcement Learning

TL;DR

This survey consolidates Explainable Deep Reinforcement Learning (XRL) by mapping four explanation levels—feature, state, data, and model—onto DRL and examining their role in policy transparency, debugging, and safety. It covers how XRL interacts with RL in the context of Large Language Models and RLHF, and surveys qualitative and quantitative evaluation frameworks, as well as applications in adversarial robustness and policy refinement. The authors highlight practical implications, security risks, and open challenges, calling for user- and developer-oriented explanations and improved integration with human-in-the-loop systems. Overall, the work provides a comprehensive blueprint for developing interpretable, trustworthy DRL systems and outlines concrete directions for future research.

Abstract

Deep Reinforcement Learning (DRL) has achieved remarkable success in sequential decision-making tasks across diverse domains, yet its reliance on black-box neural architectures hinders interpretability, trust, and deployment in high-stakes applications. Explainable Deep Reinforcement Learning (XRL) addresses these challenges by enhancing transparency through feature-level, state-level, dataset-level, and model-level explanation techniques. This survey provides a comprehensive review of XRL methods, evaluates their qualitative and quantitative assessment frameworks, and explores their role in policy refinement, adversarial robustness, and security. Additionally, we examine the integration of reinforcement learning with Large Language Models (LLMs), particularly through Reinforcement Learning from Human Feedback (RLHF), which optimizes AI alignment with human preferences. We conclude by highlighting open research challenges and future directions to advance the development of interpretable, reliable, and accountable DRL systems.

Paper Structure

This paper contains 23 sections, 4 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Taxonomy of DRL Explanation Methods