Towards an Information Theoretic Framework of Context-Based Offline Meta-Reinforcement Learning
Lanqing Li, Hai Zhang, Xinyu Zhang, Shatong Zhu, Yang Yu, Junqiao Zhao, Pheng-Ann Heng
TL;DR
The paper addresses COMRL by unifying offline meta-RL methods under an information-theoretic objective $I(Z; M)$, revealing that FOCAL, CORRO, and CSRO correspond to upper bounds, lower bounds, and convex interpolations of this quantity via a causal decomposition into $I(Z; X_t|X_b)$ and $I(Z; X_b)$. It introduces UNICORN, with a supervised variant and a self-supervised variant, to optimize $I(Z; M)$ and demonstrates strong in-distribution and exceptional out-of-distribution generalization across MuJoCo and MetaWorld benchmarks, even under varying data quality and model architectures. The framework is shown to be model-agnostic and extendable to transformer-based backbones and model-based RL through world-models, offering a principled path toward offline foundation-model pretraining for decision making. Overall, UNICORN provides a solid theoretical foundation and practical algorithms for robust task representation learning in COMRL, with promising implications for scalable, generalizable offline decision-making systems.
Abstract
As a marriage between offline RL and meta-RL, the advent of offline meta-reinforcement learning (OMRL) has shown great promise in enabling RL agents to multi-task and quickly adapt while acquiring knowledge safely. Among which, context-based OMRL (COMRL) as a popular paradigm, aims to learn a universal policy conditioned on effective task representations. In this work, by examining several key milestones in the field of COMRL, we propose to integrate these seemingly independent methodologies into a unified framework. Most importantly, we show that the pre-existing COMRL algorithms are essentially optimizing the same mutual information objective between the task variable $M$ and its latent representation $Z$ by implementing various approximate bounds. Such theoretical insight offers ample design freedom for novel algorithms. As demonstrations, we propose a supervised and a self-supervised implementation of $I(Z; M)$, and empirically show that the corresponding optimization algorithms exhibit remarkable generalization across a broad spectrum of RL benchmarks, context shift scenarios, data qualities and deep learning architectures. This work lays the information theoretic foundation for COMRL methods, leading to a better understanding of task representation learning in the context of reinforcement learning. Given its generality, we envision our framework as a promising offline pre-training paradigm of foundation models for decision making.
