Table of Contents
Fetching ...

Tackling Non-Stationarity in Reinforcement Learning via Causal-Origin Representation

Wanpeng Zhang, Yilin Li, Boyu Yang, Zongqing Lu

TL;DR

This paper addresses non-stationarity in reinforcement learning by recasting dynamics through causal relationships and introducing COREP, a method that learns a stable causal-origin representation via a dual Graph Attention Network. COREP builds an environment-shared union graph across sub-environments, leveraging a TD-error guided update to stabilize core graph structure while a general graph compensates for information loss, and fuses this with a Variational Autoencoder to guide policy learning. The approach is supported by a causal interpretation and theoretical arguments for recovering the union MAG, and is validated with extensive experiments showing improved resilience to complex non-stationarity compared with FN-VAE, VariBAD, and PPO. The work advances robust RL in realistic, non-stationary settings, with scalable future directions to address high-dimensional state spaces using richer latent-variable models.

Abstract

In real-world scenarios, the application of reinforcement learning is significantly challenged by complex non-stationarity. Most existing methods attempt to model changes in the environment explicitly, often requiring impractical prior knowledge of environments. In this paper, we propose a new perspective, positing that non-stationarity can propagate and accumulate through complex causal relationships during state transitions, thereby compounding its sophistication and affecting policy learning. We believe that this challenge can be more effectively addressed by implicitly tracing the causal origin of non-stationarity. To this end, we introduce the Causal-Origin REPresentation (COREP) algorithm. COREP primarily employs a guided updating mechanism to learn a stable graph representation for the state, termed as causal-origin representation. By leveraging this representation, the learned policy exhibits impressive resilience to non-stationarity. We supplement our approach with a theoretical analysis grounded in the causal interpretation for non-stationary reinforcement learning, advocating for the validity of the causal-origin representation. Experimental results further demonstrate the superior performance of COREP over existing methods in tackling non-stationarity problems.

Tackling Non-Stationarity in Reinforcement Learning via Causal-Origin Representation

TL;DR

This paper addresses non-stationarity in reinforcement learning by recasting dynamics through causal relationships and introducing COREP, a method that learns a stable causal-origin representation via a dual Graph Attention Network. COREP builds an environment-shared union graph across sub-environments, leveraging a TD-error guided update to stabilize core graph structure while a general graph compensates for information loss, and fuses this with a Variational Autoencoder to guide policy learning. The approach is supported by a causal interpretation and theoretical arguments for recovering the union MAG, and is validated with extensive experiments showing improved resilience to complex non-stationarity compared with FN-VAE, VariBAD, and PPO. The work advances robust RL in realistic, non-stationary settings, with scalable future directions to address high-dimensional state spaces using richer latent-variable models.

Abstract

In real-world scenarios, the application of reinforcement learning is significantly challenged by complex non-stationarity. Most existing methods attempt to model changes in the environment explicitly, often requiring impractical prior knowledge of environments. In this paper, we propose a new perspective, positing that non-stationarity can propagate and accumulate through complex causal relationships during state transitions, thereby compounding its sophistication and affecting policy learning. We believe that this challenge can be more effectively addressed by implicitly tracing the causal origin of non-stationarity. To this end, we introduce the Causal-Origin REPresentation (COREP) algorithm. COREP primarily employs a guided updating mechanism to learn a stable graph representation for the state, termed as causal-origin representation. By leveraging this representation, the learned policy exhibits impressive resilience to non-stationarity. We supplement our approach with a theoretical analysis grounded in the causal interpretation for non-stationary reinforcement learning, advocating for the validity of the causal-origin representation. Experimental results further demonstrate the superior performance of COREP over existing methods in tackling non-stationarity problems.
Paper Structure (27 sections, 1 theorem, 23 equations, 13 figures, 6 tables, 2 algorithms)

This paper contains 27 sections, 1 theorem, 23 equations, 13 figures, 6 tables, 2 algorithms.

Key Result

Proposition 3.2

Suppose that the dynamics follows Equation (eq:transition-eq:reward-function), then there exists a partial order $\pi$ on $V$ such that (a) $u$ is an ancestor of $v$$\Rightarrow u<_\pi v \text{ in } \mathcal{M}_{(k)}$; and (b) $u \leftrightarrow v \Rightarrow u \not\lessgtr_\pi v \text{ in } \mathc

Figures (13)

  • Figure 1: MAG representations for two sub-environments and their union graph. In this example, the union graph is capable of representing all possible kinds of causal relationships within the changing dynamics. More explanations can be found in Appendix \ref{['sec:full-theory']}.
  • Figure 2: Overview of the COREP framework. (1) The left part illustrates that the COREP framework can be seamlessly incorporated into any RL algorithm. It takes the state as input and outputs the causal-origin representation for policy optimization. (2) The middle part shows the VAE structure employed by the COREP framework, which is utilized to enhance the learning efficiency. (3) The right part highlights the key components of COREP. The dual GAT structure is designed in line with the concept of causal-origin representation to retain the essential parts of the graph. The TD error detection can guide the core-GAT to learn the environment-shared union graph based on our theory. The general-GAT is continuously updated to compensate for the potential loss of information.
  • Figure 3: Learning curves of our COREP algorithm and other baselines in different tasks. Solid curves indicate the mean of all trials with 5 different seeds. Shaded regions correspond to the standard deviation among trials. Dashed lines represent the asymptotic performance of PPO and Oracle.
  • Figure 4: Mean returns of 3 different trials with: (a) different components and non-stationarity settings. Returns are normalized to the full version of COREP in each environment; (b) different degrees of non-stationarity. Returns are normalized to the COREP algorithm with standard degree $1.0$.
  • Figure 5: DAG representations for two sub-environments.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Definition 3.1: Environment-shared union graph
  • Proposition 3.2
  • Definition 1.1: Global Markov Condition pearl2000Causality
  • Definition 1.2: Faithfulness pearl2000Causality
  • Definition 1.3: Mixture of stationary distributions
  • proof : Proof of Proposition \ref{['thm:idn']}