Exploiting Causal Graph Priors with Posterior Sampling for Reinforcement Learning

Mirco Mutti; Riccardo De Santi; Marcello Restelli; Alexander Marx; Giorgia Ramponi

Exploiting Causal Graph Priors with Posterior Sampling for Reinforcement Learning

Mirco Mutti, Riccardo De Santi, Marcello Restelli, Alexander Marx, Giorgia Ramponi

TL;DR

The paper introduces C-PSRL, a hierarchical Bayesian posterior-sampling method that exploits partial causal-graph priors to learn factored MDPs efficiently. By sampling a factorization consistent with a prior graph and then conditioning transition parameters on this factorization, the algorithm achieves sublinear Bayesian regret with a bound that explicitly depends on the degree of prior knowledge $ ext{η}$ and sparsity $Z$. It also demonstrates a byproduct of weak causal discovery, showing that the learned factorization converges toward a $Z$-sparse super-graph of the true causal graph. Empirical results in illustrative domains show substantial gains over uninformative priors and competitive performance relative to oracle priors, highlighting the practical utility of incorporating causal structure into posterior-sampling RL.

Abstract

Posterior sampling allows exploitation of prior knowledge on the environment's transition dynamics to improve the sample efficiency of reinforcement learning. The prior is typically specified as a class of parametric distributions, the design of which can be cumbersome in practice, often resulting in the choice of uninformative priors. In this work, we propose a novel posterior sampling approach in which the prior is given as a (partial) causal graph over the environment's variables. The latter is often more natural to design, such as listing known causal dependencies between biometric features in a medical treatment study. Specifically, we propose a hierarchical Bayesian procedure, called C-PSRL, simultaneously learning the full causal graph at the higher level and the parameters of the resulting factored dynamics at the lower level. We provide an analysis of the Bayesian regret of C-PSRL that explicitly connects the regret rate with the degree of prior knowledge. Our numerical evaluation conducted in illustrative domains confirms that C-PSRL strongly improves the efficiency of posterior sampling with an uninformative prior while performing close to posterior sampling with the full causal graph.

Exploiting Causal Graph Priors with Posterior Sampling for Reinforcement Learning

TL;DR

and sparsity

. It also demonstrates a byproduct of weak causal discovery, showing that the learned factorization converges toward a

-sparse super-graph of the true causal graph. Empirical results in illustrative domains show substantial gains over uninformative priors and competitive performance relative to oracle priors, highlighting the practical utility of incorporating causal structure into posterior-sampling RL.

Abstract

Paper Structure (34 sections, 9 theorems, 58 equations, 3 figures, 1 algorithm)

This paper contains 34 sections, 9 theorems, 58 equations, 3 figures, 1 algorithm.

Introduction
Problem formulation
Causal graphs
Markov decision processes
Causal structure induces factorization
Reinforcement learning with partial causal graph priors
Causal PSRL
Regret analysis of C-PSRL
Discussion of the Bayesian regret
C-PSRL embeds a notion of causal discovery
Experiments
Related work
Conclusion
List of symbols
Parametric priors and posterior updates
...and 19 more sections

Key Result

theorem 4.0

Let $\mathcal{G}_0$ be a causal graph prior with degree of sparseness $Z$ and degree of prior knowledge $\eta$. The $K$-episodes Bayesian regret incurred by C-PSRL is

Figures (3)

Figure 1: (Left) Illustrative causal graph prior $\mathcal{G}_0$ with $d_X = 4, d_Y=2$ features, degree of sparseness $Z = 3$. The hidden true graph $\mathcal{G}_{\mathcal{F}_*}$ includes all the edges in $\mathcal{G}_0$ plus the red-dashed edge $(3,1)$. (Right) Visualization of $\mathcal{Z}$, the set of factorizations consistent with $\mathcal{G}_0$, which is the support of the hyper-prior $P_0$. The factorization $z_*$ of the true FMDP $\mathcal{F}_*$ is highlighted in red.
Figure 2: (a,b) Regret and model error as a function of the episodes in the Random FMDP domain with $d_X = 9, d_Y = 6, Z = 5, N = 2, H = 100$. (c,d) Regret as a function of the episodes in Taxi $3 \times 3$ with $d_X = 5, d_Y = 4, Z = 5, N = [3, 3, 2, 1, 6], H = 10$, Taxi $5 \times 5$ with $d_X = 5, d_Y = 4, Z = 5, N = [5, 5, 2, 1, 6], H = 15$. The plots report the mean and 95% c.i. over 20 runs.
Figure 3: Regret and model error as a function of the episodes in the Random FMDP domain with $d_X = 9, d_Y = 6, Z = 5, N = 2, H = 100$. The plots report the mean and 95% c.i. over 20 runs.

Theorems & Definitions (17)

definition 1: Bayesian Regret
theorem 4.0
definition 2: $\epsilon$-Value Minimality
corollary 5.0: Weak Causal Discovery
definition 3: Causal Minimality
definition 3: $\epsilon$-Value Minimality
corollary D.0: Weak Causal Discovery
proof
definition 4: $d$-Separation
theorem E.0
...and 7 more

Exploiting Causal Graph Priors with Posterior Sampling for Reinforcement Learning

TL;DR

Abstract

Exploiting Causal Graph Priors with Posterior Sampling for Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (17)