Table of Contents
Fetching ...

Redefining Counterfactual Explanations for Reinforcement Learning: Overview, Challenges and Opportunities

Jasmina Gajcin, Ivana Dusparic

TL;DR

The paper tackles the transparency gap in reinforcement learning by examining counterfactual explanations (CFs) and arguing that directly porting supervised-learning CFs to RL is inadequate due to RL’s sequential, goal-driven nature. It surveys existing XRL methods (global surrogates, summaries, saliency maps, contrastive explanations) and contrasts RL with supervised learning to pinpoint where CFs must differ. A core contribution is redefining CFs for RL, including expanding the notion of variables and scores, and re-specifying CF properties (validity, proximity, actionability, sparsity, data-manifold closeness, causality, recourse) to accommodate temporality, stochasticity, and environmental constraints. The paper also outlines key challenges—search spaces, categorical variables, temporality, stochasticity, and evaluation—and proposes directions for developing RL-specific CF generation and evaluation benchmarks, aiming to enable actionable, trustworthy explanations for non-expert users. These insights are intended to advance user trust and collaboration in RL applications across domains such as healthcare, autonomous systems, and robotics by providing robust, interpretable, and verifiable counterfactual guidance within RL’s dynamic settings.

Abstract

While AI algorithms have shown remarkable success in various fields, their lack of transparency hinders their application to real-life tasks. Although explanations targeted at non-experts are necessary for user trust and human-AI collaboration, the majority of explanation methods for AI are focused on developers and expert users. Counterfactual explanations are local explanations that offer users advice on what can be changed in the input for the output of the black-box model to change. Counterfactuals are user-friendly and provide actionable advice for achieving the desired output from the AI system. While extensively researched in supervised learning, there are few methods applying them to reinforcement learning (RL). In this work, we explore the reasons for the underrepresentation of a powerful explanation method in RL. We start by reviewing the current work in counterfactual explanations in supervised learning. Additionally, we explore the differences between counterfactual explanations in supervised learning and RL and identify the main challenges that prevent the adoption of methods from supervised in reinforcement learning. Finally, we redefine counterfactuals for RL and propose research directions for implementing counterfactuals in RL.

Redefining Counterfactual Explanations for Reinforcement Learning: Overview, Challenges and Opportunities

TL;DR

The paper tackles the transparency gap in reinforcement learning by examining counterfactual explanations (CFs) and arguing that directly porting supervised-learning CFs to RL is inadequate due to RL’s sequential, goal-driven nature. It surveys existing XRL methods (global surrogates, summaries, saliency maps, contrastive explanations) and contrasts RL with supervised learning to pinpoint where CFs must differ. A core contribution is redefining CFs for RL, including expanding the notion of variables and scores, and re-specifying CF properties (validity, proximity, actionability, sparsity, data-manifold closeness, causality, recourse) to accommodate temporality, stochasticity, and environmental constraints. The paper also outlines key challenges—search spaces, categorical variables, temporality, stochasticity, and evaluation—and proposes directions for developing RL-specific CF generation and evaluation benchmarks, aiming to enable actionable, trustworthy explanations for non-expert users. These insights are intended to advance user trust and collaboration in RL applications across domains such as healthcare, autonomous systems, and robotics by providing robust, interpretable, and verifiable counterfactual guidance within RL’s dynamic settings.

Abstract

While AI algorithms have shown remarkable success in various fields, their lack of transparency hinders their application to real-life tasks. Although explanations targeted at non-experts are necessary for user trust and human-AI collaboration, the majority of explanation methods for AI are focused on developers and expert users. Counterfactual explanations are local explanations that offer users advice on what can be changed in the input for the output of the black-box model to change. Counterfactuals are user-friendly and provide actionable advice for achieving the desired output from the AI system. While extensively researched in supervised learning, there are few methods applying them to reinforcement learning (RL). In this work, we explore the reasons for the underrepresentation of a powerful explanation method in RL. We start by reviewing the current work in counterfactual explanations in supervised learning. Additionally, we explore the differences between counterfactual explanations in supervised learning and RL and identify the main challenges that prevent the adoption of methods from supervised in reinforcement learning. Finally, we redefine counterfactuals for RL and propose research directions for implementing counterfactuals in RL.
Paper Structure (31 sections, 5 equations, 5 figures, 5 tables)

This paper contains 31 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: A summary of goals of XAI depending on the target audience: while the focus of developers is to better understand the abilities of the system and enable successful deployment, experts using the system require explanations to better collaborate with the system. Explanations are necessary for non-expert users to develop trust, ensure system decisions are fair, and give users actionable feedback on how to elicit a different decision from the system.
  • Figure 2: Saliency maps for explaining decision of Atari agents for Breakout, Pong and Space Invaders environments , obtained with algorithm provided in greydanus2018visualizing. When making the decision in the pictured states, the agent focuses on the highlighted areas in the image.
  • Figure 3: Example of different types of explanations in a simple RL gridworld environment. Agent's task is to pick up a key of any color, navigate to the lock of the same color, and open it. Blue squares are ice, and stepping on them brings a large penalty to the agent with some small probability. Top: The user asks the agent why it chose action right in a specific state. Bottom left: The agent's explanation refers to the previously visited state where it picked up the blue key. Bottom middle: The agent explains its decision with a temporally distant goal. Bottom right: The agent explains its choice by expressing its preference between two conflicting objectives.
  • Figure 4: Counterfactual generation using growing spheres laugel2017inverse for MNIST dataset. Left: original instance classified as 8 Middle: the closest counterfactual instance classified as 9. Right: pixel difference between the original and counterfactual instances.
  • Figure 5: Left: original game state $x$. Right: cf game state $x'$ obtained through FACE algorithm poyiadzi2020face, answering the question "In what position would c4c8 be the best move?". While the provided counterfactual is similar to the original state based on its state features, it is not reachable from the original state using the rules of the game, and as such does not offer actionable advice to the user.