xSRL: Safety-Aware Explainable Reinforcement Learning -- Safety as a Product of Explainability

Risal Shahriar Shefin; Md Asifur Rahman; Thai Le; Sarra Alqahtani

xSRL: Safety-Aware Explainable Reinforcement Learning -- Safety as a Product of Explainability

Risal Shahriar Shefin, Md Asifur Rahman, Thai Le, Sarra Alqahtani

TL;DR

This work addresses the safety and explainability gap in reinforcement learning for real-world deployment by proposing xSRL, a framework that combines local, risk-aware explanations with global policy summaries. It introduces two post-hoc critics, $Q_{ ext{task}}(s,a)$ and $Q_{ ext{risk}}(s,a)$, to quantify task rewards and safety costs, and integrates these into an enhanced CAPS graph to produce explanations that reveal how safety constraints influence decisions. xSRL also provides adversarial explanations to identify vulnerabilities and a patching mechanism using a shielding policy to improve safety without retraining the task policy. Through quantitative fidelity measures and large-scale user studies in MuJoCo CMDP tasks, xSRL demonstrates improved trust and utility, enabling debugging, vulnerability analysis, and practical safety enhancements for real-world RL systems; code is available at the project repository.

Abstract

Reinforcement learning (RL) has shown great promise in simulated environments, such as games, where failures have minimal consequences. However, the deployment of RL agents in real-world systems such as autonomous vehicles, robotics, UAVs, and medical devices demands a higher level of safety and transparency, particularly when facing adversarial threats. Safe RL algorithms have been developed to address these concerns by optimizing both task performance and safety constraints. However, errors are inevitable, and when they occur, it is essential that the RL agents can also explain their actions to human operators. This makes trust in the safety mechanisms of RL systems crucial for effective deployment. Explainability plays a key role in building this trust by providing clear, actionable insights into the agent's decision-making process, ensuring that safety-critical decisions are well understood. While machine learning (ML) has seen significant advances in interpretability and visualization, explainability methods for RL remain limited. Current tools fail to address the dynamic, sequential nature of RL and its needs to balance task performance with safety constraints over time. The re-purposing of traditional ML methods, such as saliency maps, is inadequate for safety-critical RL applications where mistakes can result in severe consequences. To bridge this gap, we propose xSRL, a framework that integrates both local and global explanations to provide a comprehensive understanding of RL agents' behavior. xSRL also enables developers to identify policy vulnerabilities through adversarial attacks, offering tools to debug and patch agents without retraining. Our experiments and user studies demonstrate xSRL's effectiveness in increasing safety in RL systems, making them more reliable and trustworthy for real-world deployment. Code is available at https://github.com/risal-shefin/xSRL.

xSRL: Safety-Aware Explainable Reinforcement Learning -- Safety as a Product of Explainability

TL;DR

and

, to quantify task rewards and safety costs, and integrates these into an enhanced CAPS graph to produce explanations that reveal how safety constraints influence decisions. xSRL also provides adversarial explanations to identify vulnerabilities and a patching mechanism using a shielding policy to improve safety without retraining the task policy. Through quantitative fidelity measures and large-scale user studies in MuJoCo CMDP tasks, xSRL demonstrates improved trust and utility, enabling debugging, vulnerability analysis, and practical safety enhancements for real-world RL systems; code is available at the project repository.

Abstract

Paper Structure (15 sections, 7 equations, 3 figures, 4 tables)

This paper contains 15 sections, 7 equations, 3 figures, 4 tables.

Introduction
Related Work
Background
Approach: Safety-Aware Explainable RL Method
Safety Interpretation via Integrating Local and Global Explanations
Safety Debugging via Adversarial Explanation
Patching Explanation-Based Discovered Vulnerabilities
Evaluation
Trustworthiness of xSRL's Explanations
Fidelity
User Studies
Utility of xSRL's Explanations
Impact of Explanation-guided Attack and Patching Techniques.
User Studies.
Conclusion

Figures (3)

Figure 1: Examples of generated explanations for Navigation2 task using our local explanation, global explanation from CAPS McCalmon, and our xSRL that integrates local and global explanations.
Figure 2: Safety(%) and success-safety(%) performance of the patched agents under the influence of various rates of the explanation-guided attack in Navigation 2.
Figure 3: An example of using (a) the xSRL explanation graph to launch an attack on the SAC agent at high-risk states, (b) the xSRL explanation of the agent's behavior under attack, and (c) the explanation of the same agent's behavior under the same attack after being patched with the safe policy from AdvExRL adv.

xSRL: Safety-Aware Explainable Reinforcement Learning -- Safety as a Product of Explainability

TL;DR

Abstract

xSRL: Safety-Aware Explainable Reinforcement Learning -- Safety as a Product of Explainability

Authors

TL;DR

Abstract

Table of Contents

Figures (3)