Table of Contents
Fetching ...

Debiased Offline Representation Learning for Fast Online Adaptation in Non-stationary Dynamics

Xinyu Zhang, Wenjie Qiu, Yi-Chen Li, Lei Yuan, Chengxing Jia, Zongzhang Zhang, Yang Yu

TL;DR

Offline RL under non-stationary dynamics suffers from confounding signals between environment changes and behavior policies. DORA addresses this by applying an Information Bottleneck to learn debiased, dynamics-relevant representations from recent state-action histories, using a distortion-based contrastive bound for $I(z;M)$ and a KL-based debias loss for $I(z;a)$. The encoder, paired with a contextual policy trained via offline RL such as $\text{CQL}$, enables fast online adaptation without pre-collected context, demonstrated across six MuJoCo tasks with changing dynamics. Results show sharper dynamics encoding, improved performance over baselines in IID, OOD, and non-stationary settings, and clear visualization of debiased, dynamics-aligned latent representations, indicating practical impact for safe and efficient offline-to-online adaptation.

Abstract

Developing policies that can adjust to non-stationary environments is essential for real-world reinforcement learning applications. However, learning such adaptable policies in offline settings, with only a limited set of pre-collected trajectories, presents significant challenges. A key difficulty arises because the limited offline data makes it hard for the context encoder to differentiate between changes in the environment dynamics and shifts in the behavior policy, often leading to context misassociations. To address this issue, we introduce a novel approach called Debiased Offline Representation for fast online Adaptation (DORA). DORA incorporates an information bottleneck principle that maximizes mutual information between the dynamics encoding and the environmental data, while minimizing mutual information between the dynamics encoding and the actions of the behavior policy. We present a practical implementation of DORA, leveraging tractable bounds of the information bottleneck principle. Our experimental evaluation across six benchmark MuJoCo tasks with variable parameters demonstrates that DORA not only achieves a more precise dynamics encoding but also significantly outperforms existing baselines in terms of performance.

Debiased Offline Representation Learning for Fast Online Adaptation in Non-stationary Dynamics

TL;DR

Offline RL under non-stationary dynamics suffers from confounding signals between environment changes and behavior policies. DORA addresses this by applying an Information Bottleneck to learn debiased, dynamics-relevant representations from recent state-action histories, using a distortion-based contrastive bound for and a KL-based debias loss for . The encoder, paired with a contextual policy trained via offline RL such as , enables fast online adaptation without pre-collected context, demonstrated across six MuJoCo tasks with changing dynamics. Results show sharper dynamics encoding, improved performance over baselines in IID, OOD, and non-stationary settings, and clear visualization of debiased, dynamics-aligned latent representations, indicating practical impact for safe and efficient offline-to-online adaptation.

Abstract

Developing policies that can adjust to non-stationary environments is essential for real-world reinforcement learning applications. However, learning such adaptable policies in offline settings, with only a limited set of pre-collected trajectories, presents significant challenges. A key difficulty arises because the limited offline data makes it hard for the context encoder to differentiate between changes in the environment dynamics and shifts in the behavior policy, often leading to context misassociations. To address this issue, we introduce a novel approach called Debiased Offline Representation for fast online Adaptation (DORA). DORA incorporates an information bottleneck principle that maximizes mutual information between the dynamics encoding and the environmental data, while minimizing mutual information between the dynamics encoding and the actions of the behavior policy. We present a practical implementation of DORA, leveraging tractable bounds of the information bottleneck principle. Our experimental evaluation across six benchmark MuJoCo tasks with variable parameters demonstrates that DORA not only achieves a more precise dynamics encoding but also significantly outperforms existing baselines in terms of performance.
Paper Structure (40 sections, 4 theorems, 14 equations, 9 figures, 8 tables, 2 algorithms)

This paper contains 40 sections, 4 theorems, 14 equations, 9 figures, 8 tables, 2 algorithms.

Key Result

Theorem 3.1

Denote a set of $N$ tasks as $\mathcal{M}$, in which each task $M_i$ is sampled from the same training task distribution $P_{\rm train}$. Let random variables $M \in \mathcal{M}$, $\tau$ be a trajectory collected in $M$, $z \sim p_\phi(\cdot|\tau)$, $p(z)$ is the prior distribution of $z$, then we h where $\tau^i$ is a trajectory collected in task $M_i$, $z_i \sim p_\phi(\cdot|\tau^i)$, and $i \in

Figures (9)

  • Figure 1: The DORA framework. The encoder utilizes recent state-action pairs to maintain a set of representations $\{z^{1}, \cdots, z^H \}$ and $z^H$ updates the moving average task encodings $\{\bar{z}_i\}_{i=1}^N$. All these representations are then used to optimize the encoder. The contextual policy is trained through offline RL on the datasets, where each transition is labeled with its representation by the learned encoder.
  • Figure 2: Representation visualization in Cheetah-gravity tasks with IID dynamics. The points are the (projected) representations in a 2D latent space, with the color indicating the real parameters of dynamics.
  • Figure 3: Representation tracking in a single trajectory in non-stationary dynamics. Real represents the normalized real parameters of unseen dynamics, DORA_x and DORA_y are the coordinates of the DORA's representations in the 2D latent space, and the same applies to offline ESCP. Left: In Cheetah-dof. Right: In Cheetah-gravity.
  • Figure 4: Visualization on task representations generated with 2 different context-collection policies in 5 unseen dynamics. Points of different shapes represent different policies. Left: In Cheetah-gravity. Right: In Cheetah-dof.
  • Figure 5: Ablation studies: Average normalized return of DORA on 3 environments over 5 random seeds. The error bar stands for the standard deviation. Left: DORA with and without the debias loss. Right: RNN history lengths of 4, 8, and 16.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 1.1
  • proof
  • Theorem 1.1
  • proof