Table of Contents
Fetching ...

Robust Counterfactual Inference in Markov Decision Processes

Jessica Lally, Milad Kazemi, Nicola Paoletti

TL;DR

This paper tackles the identifiability problem in counterfactual inference for MDPs by introducing a non-parametric framework that computes tight analytical bounds across all compatible causal models. It converts partial counterfactual inference into exact analytical bounds via canonical SCMs, enabling the construction of interval CFMDPs that encode uncertainty and support robust, worst-case optimization through pessimistic value iteration. The approach achieves substantial speedups over traditional Gumbel-max-based methods (4–251x) and demonstrates enhanced robustness across GridWorld, Sepsis, Frozen Lake, and Aircraft scenarios, making offline policy evaluation and robust counterfactual reasoning more reliable for safety-critical applications. The contribution provides a scalable, theory-grounded path to sound counterfactual explanations and robust policy improvements under causal model uncertainty, with practical implications for offline RL and IRL contexts.

Abstract

This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.

Robust Counterfactual Inference in Markov Decision Processes

TL;DR

This paper tackles the identifiability problem in counterfactual inference for MDPs by introducing a non-parametric framework that computes tight analytical bounds across all compatible causal models. It converts partial counterfactual inference into exact analytical bounds via canonical SCMs, enabling the construction of interval CFMDPs that encode uncertainty and support robust, worst-case optimization through pessimistic value iteration. The approach achieves substantial speedups over traditional Gumbel-max-based methods (4–251x) and demonstrates enhanced robustness across GridWorld, Sepsis, Frozen Lake, and Aircraft scenarios, making offline policy evaluation and robust counterfactual reasoning more reliable for safety-critical applications. The contribution provides a scalable, theory-grounded path to sound counterfactual explanations and robust policy improvements under causal model uncertainty, with practical implications for offline RL and IRL contexts.

Abstract

This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.

Paper Structure

This paper contains 53 sections, 17 theorems, 43 equations, 13 figures, 7 tables.

Key Result

Theorem 4.1

For the observed state-action pair $(s_t, a_t)$, the linear program will produce the following bounds:

Figures (13)

  • Figure 1: MDP causal graph. White nodes represent endogenous/observable variables; grey nodes represent exogenous/unobserved variables.
  • Figure 2: Example MDP where Gumbel-max produces unintuitive CF probabilities. The observed path is $s_0 \rightarrow s_1$.
  • Figure 3: CF inference approaches for off-policy evaluation (GridWorld ($p=0.4$))
  • Figure 4: Average instant reward of CF paths induced by policies on GridWorld $p=0.9$. Error bars denote the standard deviation in reward at each time step.
  • Figure 5: Average instant reward of CF paths induced by policies on GridWorld $p=0.4$.
  • ...and 8 more figures

Theorems & Definitions (23)

  • Definition 2.1: Counterfactual MDP (CFMDP)
  • Definition 3.1: c-component tian2002general
  • Definition 3.2: Canonical SCM pmlr-v162-zhang22ab
  • Definition 3.3: Counterfactual stability
  • Definition 3.4: Counterfactual monotonicity
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Definition 5.1: Interval Counterfactual MDP (ICFMDP)
  • Theorem C.1
  • ...and 13 more