Offline Reinforcement Learning from Datasets with Structured Non-Stationarity
Johannes Ackermann, Takayuki Osa, Masashi Sugiyama
TL;DR
This work tackles offline reinforcement learning under structured non-stationarity, modeling data gathered from multiple deployments as a Dynamic-Parameter MDP with a hidden parameter $z$ that remains fixed within an episode but evolves between episodes. The authors introduce COSPA, a CPC-based approach that learns a latent representation of the HiP from past trajectories, predicts the next HiP during evaluation, and trains a context-conditioned policy on augmented data $\\hat{s}=(s,\tilde{z})$ using a TD3+BC objective. Empirically, COSPA yields informative HiP representations, accurate latent prediction, and near-oracular policy performance across a spectrum of low- and high-dimensional tasks, outperforming baselines like BOReL and ContraBAR in many settings. This method provides a practical path to robust offline RL in real-world systems where non-stationarity arises from wear, hardware changes, or deployment variability, and the authors release code and datasets to facilitate further research.
Abstract
Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy. Offline RL aims to solve this issue by using transitions collected by a different behavior policy. We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode. We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation. We analyze our proposed method and show that it performs well in simple continuous control tasks and challenging, high-dimensional locomotion tasks. We show that our method often achieves the oracle performance and performs better than baselines.
