Table of Contents
Fetching ...

Rethinking State Disentanglement in Causal Reinforcement Learning

Haiyao Cao, Zhen Zhang, Panpan Cai, Yuhang Liu, Jinan Zou, Ehsan Abbasnejad, Biwei Huang, Mingming Gong, Anton van den Hengel, Javen Qinfeng Shi

TL;DR

This research line is revisited and it is found that incorporating RL-specific context can reduce unnecessary assumptions in previous identifiability analyses for latent states and allow algorithm design to go beyond the earlier boundaries constrained by them.

Abstract

One of the significant challenges in reinforcement learning (RL) when dealing with noise is estimating latent states from observations. Causality provides rigorous theoretical support for ensuring that the underlying states can be uniquely recovered through identifiability. Consequently, some existing work focuses on establishing identifiability from a causal perspective to aid in the design of algorithms. However, these results are often derived from a purely causal viewpoint, which may overlook the specific RL context. We revisit this research line and find that incorporating RL-specific context can reduce unnecessary assumptions in previous identifiability analyses for latent states. More importantly, removing these assumptions allows algorithm design to go beyond the earlier boundaries constrained by them. Leveraging these insights, we propose a novel approach for general partially observable Markov Decision Processes (POMDPs) by replacing the complicated structural constraints in previous methods with two simple constraints for transition and reward preservation. With the two constraints, the proposed algorithm is guaranteed to disentangle state and noise that is faithful to the underlying dynamics. Empirical evidence from extensive benchmark control tasks demonstrates the superiority of our approach over existing counterparts in effectively disentangling state belief from noise.

Rethinking State Disentanglement in Causal Reinforcement Learning

TL;DR

This research line is revisited and it is found that incorporating RL-specific context can reduce unnecessary assumptions in previous identifiability analyses for latent states and allow algorithm design to go beyond the earlier boundaries constrained by them.

Abstract

One of the significant challenges in reinforcement learning (RL) when dealing with noise is estimating latent states from observations. Causality provides rigorous theoretical support for ensuring that the underlying states can be uniquely recovered through identifiability. Consequently, some existing work focuses on establishing identifiability from a causal perspective to aid in the design of algorithms. However, these results are often derived from a purely causal viewpoint, which may overlook the specific RL context. We revisit this research line and find that incorporating RL-specific context can reduce unnecessary assumptions in previous identifiability analyses for latent states. More importantly, removing these assumptions allows algorithm design to go beyond the earlier boundaries constrained by them. Leveraging these insights, we propose a novel approach for general partially observable Markov Decision Processes (POMDPs) by replacing the complicated structural constraints in previous methods with two simple constraints for transition and reward preservation. With the two constraints, the proposed algorithm is guaranteed to disentangle state and noise that is faithful to the underlying dynamics. Empirical evidence from extensive benchmark control tasks demonstrates the superiority of our approach over existing counterparts in effectively disentangling state belief from noise.
Paper Structure (31 sections, 3 theorems, 13 equations, 7 figures, 6 tables)

This paper contains 31 sections, 3 theorems, 13 equations, 7 figures, 6 tables.

Key Result

Proposition 1

Given POMDP $(\mathcal{S},\mathcal{A},\mathcal{O},\mathcal{T},\mathcal{M},\mathop{\mathrm{\mathcal{R}}}\nolimits, \gamma)$, if the observation function $\mathop{\mathrm{\mathcal{M}}}\nolimits$ is invertible. Let $g:\mathop{\mathrm{\mathcal{O}}}\nolimits\mapsto \mathop{\mathrm{\mathcal{S}}}\nolimits\ then the estimated MDP $(\hat{\mathop{\mathrm{\mathcal{S}}}\nolimits}=\{\hat{s}|\exists o, \text{s.

Figures (7)

  • Figure 1: The overall illustration of our world model consists of state belief RSSM (in green) and noise belief RSSM (in purple). They are differentiated by the action, reward model and emission models, which are the key parts of maintaining reward preservation and transition preservation. Furthermore, RSSM for the belief of noise does more than merely model the transitions of noise, it also encapsulates the inherent uncertainty within the system, due to Bayesian nature. Visualizations of the recovered state and noise clearly show that our world model significantly outperforms TIA fu2021learning, DenoisedMDP wang2022denoised and IFactor liu2023learning in disentangling state and noise.
  • Figure 2: This is a typical generative model for RL on MDP with the noisy observation, where $s$ denotes the latent state while $z$ represents the latent noise, both of which remain unobservable. Previous work huang2022actionliu2023learning is built on different generative models, e.g. dividing $s$ into independent parts with additional assumptions, which could hinge the application in the real world. In contrast, our work is not limited by a specific generative model and guarantees that the estimated MDP is equivalent to the underlying true MDP when transition preservation and reward preservation are maintained.
  • Figure 3: Our method is specifically designed to disentangle latent state from latent noise, which means it may not perform exceptionally in noiseless environments. However, evaluating performance in noiseless settings can offer valuable insights by allowing a comparative analysis of each method's effectiveness in both noisy and noiseless environments.
  • Figure 4: In the DMC with a uniform background, where the environment setting is less complex than in diverse scenarios, our method either achieves the best performance or consistently demonstrates comparable performance when measured against SOTA results.
  • Figure 5: In the diverse video background scenario, our method outperforms strong baselines and achieves the best performance in 5 out of 6 tasks, except Reacher Easy. DreamerV3, benefiting from its larger architecture, achieves the best performance on Reacher Easy. Our method could also achieve comparable performance with an expanded belief space and posterior space.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Proposition 1
  • proof
  • Proposition 2
  • Proposition 1: For Invertible $\mathop{\mathrm{\mathcal{M}}}\nolimits$
  • proof
  • proof