Table of Contents
Fetching ...

Learning Causal States Under Partial Observability and Perturbation

Na Li, Hangguan Shan, Wei Ni, Wenjie Zhang, Xinyu Li, Yamin Wang

TL;DR

CaDiff advances RL under perturbed POMDPs by coupling an Asynchronous Diffusion Model for denoising with a Wasserstein-based bisimulation objective to recover causal states. The framework is model-agnostic and improves decision-making by preserving causal structure while suppressing noise, with a formal bound on value-function approximation error. Empirical results on Roboschool tasks show consistent, sizable gains, including faster early-stage learning, and ablations highlight the dominance of state denoising in high-dimensional observations. Overall, CaDiff provides both theoretical guarantees and practical tools for causal-state representation under partial observability and perturbation, enabling more robust RL in real-world noisy environments.

Abstract

A critical challenge for reinforcement learning (RL) is making decisions based on incomplete and noisy observations, especially in perturbed and partially observable Markov decision processes (P$^2$OMDPs). Existing methods fail to mitigate perturbations while addressing partial observability. We propose \textit{Causal State Representation under Asynchronous Diffusion Model (CaDiff)}, a framework that enhances any RL algorithm by uncovering the underlying causal structure of P$^2$OMDPs. This is achieved by incorporating a novel asynchronous diffusion model (ADM) and a new bisimulation metric. ADM enables forward and reverse processes with different numbers of steps, thus interpreting the perturbation of P$^2$OMDP as part of the noise suppressed through diffusion. The bisimulation metric quantifies the similarity between partially observable environments and their causal counterparts. Moreover, we establish the theoretical guarantee of CaDiff by deriving an upper bound for the value function approximation errors between perturbed observations and denoised causal states, reflecting a principled trade-off between approximation errors of reward and transition-model. Experiments on Roboschool tasks show that CaDiff enhances returns by at least 14.18\% compared to baselines. CaDiff is the first framework that approximates causal states using diffusion models with both theoretical rigor and practicality.

Learning Causal States Under Partial Observability and Perturbation

TL;DR

CaDiff advances RL under perturbed POMDPs by coupling an Asynchronous Diffusion Model for denoising with a Wasserstein-based bisimulation objective to recover causal states. The framework is model-agnostic and improves decision-making by preserving causal structure while suppressing noise, with a formal bound on value-function approximation error. Empirical results on Roboschool tasks show consistent, sizable gains, including faster early-stage learning, and ablations highlight the dominance of state denoising in high-dimensional observations. Overall, CaDiff provides both theoretical guarantees and practical tools for causal-state representation under partial observability and perturbation, enabling more robust RL in real-world noisy environments.

Abstract

A critical challenge for reinforcement learning (RL) is making decisions based on incomplete and noisy observations, especially in perturbed and partially observable Markov decision processes (POMDPs). Existing methods fail to mitigate perturbations while addressing partial observability. We propose \textit{Causal State Representation under Asynchronous Diffusion Model (CaDiff)}, a framework that enhances any RL algorithm by uncovering the underlying causal structure of POMDPs. This is achieved by incorporating a novel asynchronous diffusion model (ADM) and a new bisimulation metric. ADM enables forward and reverse processes with different numbers of steps, thus interpreting the perturbation of POMDP as part of the noise suppressed through diffusion. The bisimulation metric quantifies the similarity between partially observable environments and their causal counterparts. Moreover, we establish the theoretical guarantee of CaDiff by deriving an upper bound for the value function approximation errors between perturbed observations and denoised causal states, reflecting a principled trade-off between approximation errors of reward and transition-model. Experiments on Roboschool tasks show that CaDiff enhances returns by at least 14.18\% compared to baselines. CaDiff is the first framework that approximates causal states using diffusion models with both theoretical rigor and practicality.

Paper Structure

This paper contains 29 sections, 13 theorems, 103 equations, 3 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

For any distributions $\mu$ and $\lambda$, $W_q(\mu, \lambda) \geq W_p(\mu, \lambda), ~\forall q \geq p$.

Figures (3)

  • Figure 1: System model: Solid-line and dashed-line circles denote observed and unobserved variables, respectively; solid and dashed lines represent causality and decision relationships, respectively.
  • Figure 2: Comparison of CaDiff and baselines on six environments.
  • Figure 3: Ablation studies of CaDiff on six environments.

Theorems & Definitions (31)

  • Definition 1: Wasserstein metric villani2008optimal
  • Definition 2: Dual formulation of Wasserstein metric villani2008optimal
  • Lemma 1: $p$-Wasserstein Inequality villani2008optimal
  • Lemma 2: Bounds on Wasserstein distance santambrogio2015optimal
  • Definition 3: Hölder norm and Hölder ball
  • Definition 4: CSR under bisimulation
  • Definition 5
  • Definition 6
  • Remark 1
  • proof
  • ...and 21 more