Table of Contents
Fetching ...

Offline Reinforcement Learning from Datasets with Structured Non-Stationarity

Johannes Ackermann, Takayuki Osa, Masashi Sugiyama

TL;DR

This work tackles offline reinforcement learning under structured non-stationarity, modeling data gathered from multiple deployments as a Dynamic-Parameter MDP with a hidden parameter $z$ that remains fixed within an episode but evolves between episodes. The authors introduce COSPA, a CPC-based approach that learns a latent representation of the HiP from past trajectories, predicts the next HiP during evaluation, and trains a context-conditioned policy on augmented data $\\hat{s}=(s,\tilde{z})$ using a TD3+BC objective. Empirically, COSPA yields informative HiP representations, accurate latent prediction, and near-oracular policy performance across a spectrum of low- and high-dimensional tasks, outperforming baselines like BOReL and ContraBAR in many settings. This method provides a practical path to robust offline RL in real-world systems where non-stationarity arises from wear, hardware changes, or deployment variability, and the authors release code and datasets to facilitate further research.

Abstract

Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy. Offline RL aims to solve this issue by using transitions collected by a different behavior policy. We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode. We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation. We analyze our proposed method and show that it performs well in simple continuous control tasks and challenging, high-dimensional locomotion tasks. We show that our method often achieves the oracle performance and performs better than baselines.

Offline Reinforcement Learning from Datasets with Structured Non-Stationarity

TL;DR

This work tackles offline reinforcement learning under structured non-stationarity, modeling data gathered from multiple deployments as a Dynamic-Parameter MDP with a hidden parameter that remains fixed within an episode but evolves between episodes. The authors introduce COSPA, a CPC-based approach that learns a latent representation of the HiP from past trajectories, predicts the next HiP during evaluation, and trains a context-conditioned policy on augmented data using a TD3+BC objective. Empirically, COSPA yields informative HiP representations, accurate latent prediction, and near-oracular policy performance across a spectrum of low- and high-dimensional tasks, outperforming baselines like BOReL and ContraBAR in many settings. This method provides a practical path to robust offline RL in real-world systems where non-stationarity arises from wear, hardware changes, or deployment variability, and the authors release code and datasets to facilitate further research.

Abstract

Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy. Offline RL aims to solve this issue by using transitions collected by a different behavior policy. We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode. We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation. We analyze our proposed method and show that it performs well in simple continuous control tasks and challenging, high-dimensional locomotion tasks. We show that our method often achieves the oracle performance and performs better than baselines.
Paper Structure (42 sections, 2 theorems, 6 equations, 8 figures, 4 tables)

This paper contains 42 sections, 2 theorems, 6 equations, 8 figures, 4 tables.

Key Result

Lemma B.1

Let the InfoNCE loss in equation eq:cpcreprloss be jointly minimized by $f, g_{\mathrm{enc}}, g_{\mathrm{ar}}$, then for any trajectory $\tau$, with $c_i=g_{\mathrm{ar}}(g_{\mathrm{enc}}(\tau_{i-1},\tau_{i-2},\dots, \tau_{1}))$, we have

Figures (8)

  • Figure 1: We address an Offline RL setting in which the dataset is generated from multiple deployments with evolving non-stationarity. We make the structural assumption of the reward and transition functions depending on a hidden-parameter $z$ that is constant during each episode but evolves between episodes. Following this assumption, we develop a method based on Contrastive Predictive Coding that infers the hidden parameter from the deployments in our dataset. We then train a predictor and policy to use during evaluation with access to context trajectories.
  • Figure 2: Left: Graphical model of the DP-MDP. Right: Illustration of a deployment sampled from the dataset and our approach to infer the hidden variable. We use Contrastive Predictive Coding to learn a model that can discriminate future trajectories $\tau_{i+k}$ based on past trajectories $(\tau_{i},\tau_{i-1},\dots,\tau_1)$ by learning a representation of the past trajectories $(\tilde{z}_i,\tilde{z}_{i-1},\dots,\tilde{z}_1)$.
  • Figure 3: Illustrations of our evaluation environments. From left to right: 1D-Goal, 2D-Goal, 2D-Wind, Ant-Leg, Ant-Weight, Barkour-Weight. In 1D-Goal and 2D-Goal the goal location and thus the reward function depends on the HiP $z$. In the remaining tasks the transition function changes.
  • Figure 4: Comparison of the learned representations. The left side shows T-SNE visualizations, the right side shows the mean test accuracy with 95% CIs across 20 trials of linear probes trained to predict the ground-truth HiPs. Each dot is the embedding of a trajectory, the HiP of which is represented by the color. For BOReL and VRNN the mean $\mu$ of the posterior is visualized.
  • Figure 5: T-SNE visualization of the inferred latents $\tilde{z}$ as crosses ($\times$), and predicted latents $\bar{z}_i=f_\mathrm{pred}(\tilde{z}_{i-N_c},\dots,\tilde{z}_{i-1})$ as circles ($\circ$).
  • ...and 3 more figures

Theorems & Definitions (5)

  • Definition 2.1
  • Lemma B.1
  • proof
  • Lemma B.2
  • proof