Counterfactual Influence in Markov Decision Processes

Milad Kazemi; Jessica Lally; Ekaterina Tishchenko; Hana Chockler; Nicola Paoletti

Counterfactual Influence in Markov Decision Processes

Milad Kazemi, Jessica Lally, Ekaterina Tishchenko, Hana Chockler, Nicola Paoletti

TL;DR

Counterfactual reasoning in Markov decision processes can produce interventional rather than personalized outcomes as the counterfactual path diverges from the observed trajectory. The authors formalize a notion of counterfactual influence, develop a polynomial-time pruning algorithm to construct influence-constrained counterfactual MDPs, and define $(k,m)$-CF policies that balance optimality with staying informed by the observed path. Through Grid World, epidemic, and sepsis case studies, they show that near-optimal counterfactual policies can be obtained while preserving influence, yielding more informative, observation-tailored explanations. This work advances causal reinforcement learning by providing a principled mechanism to trade off influence against reward and by enabling more credible counterfactual analysis in sequential decision-making.

Abstract

Our work addresses a fundamental problem in the context of counterfactual inference for Markov Decision Processes (MDPs). Given an MDP path $τ$, this kind of inference allows us to derive counterfactual paths $τ'$ describing what-if versions of $τ$ obtained under different action sequences than those observed in $τ$. However, as the counterfactual states and actions deviate from the observed ones over time, the observation $τ$ may no longer influence the counterfactual world, meaning that the analysis is no longer tailored to the individual observation, resulting in interventional outcomes rather than counterfactual ones. Even though this issue specifically affects the popular Gumbel-max structural causal model used for MDP counterfactuals, it has remained overlooked until now. In this work, we introduce a formal characterisation of influence based on comparing counterfactual and interventional distributions. We devise an algorithm to construct counterfactual models that automatically satisfy influence constraints. Leveraging such models, we derive counterfactual policies that are not just optimal for a given reward structure but also remain tailored to the observed path. Even though there is an unavoidable trade-off between policy optimality and strength of influence constraints, our experiments demonstrate that it is possible to derive (near-)optimal policies while remaining under the influence of the observation.

Counterfactual Influence in Markov Decision Processes

TL;DR

-CF policies that balance optimality with staying informed by the observed path. Through Grid World, epidemic, and sepsis case studies, they show that near-optimal counterfactual policies can be obtained while preserving influence, yielding more informative, observation-tailored explanations. This work advances causal reinforcement learning by providing a principled mechanism to trade off influence against reward and by enabling more credible counterfactual analysis in sequential decision-making.

Abstract

Our work addresses a fundamental problem in the context of counterfactual inference for Markov Decision Processes (MDPs). Given an MDP path

, this kind of inference allows us to derive counterfactual paths

describing what-if versions of

obtained under different action sequences than those observed in

. However, as the counterfactual states and actions deviate from the observed ones over time, the observation

may no longer influence the counterfactual world, meaning that the analysis is no longer tailored to the individual observation, resulting in interventional outcomes rather than counterfactual ones. Even though this issue specifically affects the popular Gumbel-max structural causal model used for MDP counterfactuals, it has remained overlooked until now. In this work, we introduce a formal characterisation of influence based on comparing counterfactual and interventional distributions. We devise an algorithm to construct counterfactual models that automatically satisfy influence constraints. Leveraging such models, we derive counterfactual policies that are not just optimal for a given reward structure but also remain tailored to the observed path. Even though there is an unavoidable trade-off between policy optimality and strength of influence constraints, our experiments demonstrate that it is possible to derive (near-)optimal policies while remaining under the influence of the observation.

Paper Structure (32 sections, 2 theorems, 5 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 32 sections, 2 theorems, 5 equations, 11 figures, 6 tables, 1 algorithm.

Introduction
Motivating Example
Preliminaries
SCM-based Encoding of MDPs
Counterfactual inference.
Counterfactual MDP and Optimal Policies
Theoretical Framework
Methodology
Experiments
Setup
Grid World
Epidemic Model
Sepsis Model
Reduction in MDP Size
Conclusion
...and 17 more sections

Key Result

Proposition 2

Let $\tau$ be a path of an MDP $\mathcal{M}$ of length $T$, and let $\mathcal{M}^\tau$ be the corresponding counterfactual MDP. Given a time $t<T$ and counterfactual state $s'_t$ and action $a'_t$ in $\mathcal{M}^\tau$, then if $P_{\mathcal{M}}(\cdot \mid s_t, a_t)$ and $P_{\mathcal{M}}(\cdot \mid s

Figures (11)

Figure 1: Subset of the state space of the Sepsis MDP. The spectrum from blue to red represents how frequently the state appears in simulated paths of diabetic patients (in red) vs. the whole population (in blue). The intensity of the colour represents how frequently the states are visited in the simulated paths. The black line is an observed trajectory for a diabetic patient; the blue line is the unconstrained counterfactual generated for that path, and, in red, the influence-constrained counterfactual path. The unconstrained counterfactual path diverges further from the the observation than the influenced counterfactual, as the observation and influenced counterfactual reach the same state at multiple timesteps ($t=1$ and $t=6$), and the states along both paths are shaded with a similar hue and intensity of red, indicating these paths have comparable (high) likelihoods of occurring in diabetic patients (vs. the general population), unlike the unconstrained counterfactual which visits a completely disjoint set of states.
Figure 2: MDP causal graph
Figure 3: Example counterfactual MDP given $k$-step influence. State-action pairs may or may not be influenced by the observed path, and states may or may not be reachable from other influenced state-action pairs.
Figure 4: Grid World: value of initial state given $k$-step influence and maximum $m$ actions changed
Figure 5: Epidemic MDP analysis results
...and 6 more figures

Theorems & Definitions (7)

Definition 1: $m$-CF policy tsirtsis2021counterfactual
Proposition 2
Definition 3: 1-step influence
Definition 4: $k$-step influence
Remark 5
Definition 6: $(k, m)$-CF policy
Theorem 7: Optimal $(k,m)$-CF Policy Guarantee

Counterfactual Influence in Markov Decision Processes

TL;DR

Abstract

Counterfactual Influence in Markov Decision Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (7)