Towards an Information Theoretic Framework of Context-Based Offline Meta-Reinforcement Learning

Lanqing Li; Hai Zhang; Xinyu Zhang; Shatong Zhu; Yang Yu; Junqiao Zhao; Pheng-Ann Heng

Towards an Information Theoretic Framework of Context-Based Offline Meta-Reinforcement Learning

Lanqing Li, Hai Zhang, Xinyu Zhang, Shatong Zhu, Yang Yu, Junqiao Zhao, Pheng-Ann Heng

TL;DR

The paper addresses COMRL by unifying offline meta-RL methods under an information-theoretic objective $I(Z; M)$, revealing that FOCAL, CORRO, and CSRO correspond to upper bounds, lower bounds, and convex interpolations of this quantity via a causal decomposition into $I(Z; X_t|X_b)$ and $I(Z; X_b)$. It introduces UNICORN, with a supervised variant and a self-supervised variant, to optimize $I(Z; M)$ and demonstrates strong in-distribution and exceptional out-of-distribution generalization across MuJoCo and MetaWorld benchmarks, even under varying data quality and model architectures. The framework is shown to be model-agnostic and extendable to transformer-based backbones and model-based RL through world-models, offering a principled path toward offline foundation-model pretraining for decision making. Overall, UNICORN provides a solid theoretical foundation and practical algorithms for robust task representation learning in COMRL, with promising implications for scalable, generalizable offline decision-making systems.

Abstract

As a marriage between offline RL and meta-RL, the advent of offline meta-reinforcement learning (OMRL) has shown great promise in enabling RL agents to multi-task and quickly adapt while acquiring knowledge safely. Among which, context-based OMRL (COMRL) as a popular paradigm, aims to learn a universal policy conditioned on effective task representations. In this work, by examining several key milestones in the field of COMRL, we propose to integrate these seemingly independent methodologies into a unified framework. Most importantly, we show that the pre-existing COMRL algorithms are essentially optimizing the same mutual information objective between the task variable $M$ and its latent representation $Z$ by implementing various approximate bounds. Such theoretical insight offers ample design freedom for novel algorithms. As demonstrations, we propose a supervised and a self-supervised implementation of $I(Z; M)$, and empirically show that the corresponding optimization algorithms exhibit remarkable generalization across a broad spectrum of RL benchmarks, context shift scenarios, data qualities and deep learning architectures. This work lays the information theoretic foundation for COMRL methods, leading to a better understanding of task representation learning in the context of reinforcement learning. Given its generality, we envision our framework as a promising offline pre-training paradigm of foundation models for decision making.

Towards an Information Theoretic Framework of Context-Based Offline Meta-Reinforcement Learning

TL;DR

The paper addresses COMRL by unifying offline meta-RL methods under an information-theoretic objective

, revealing that FOCAL, CORRO, and CSRO correspond to upper bounds, lower bounds, and convex interpolations of this quantity via a causal decomposition into

and

. It introduces UNICORN, with a supervised variant and a self-supervised variant, to optimize

and demonstrates strong in-distribution and exceptional out-of-distribution generalization across MuJoCo and MetaWorld benchmarks, even under varying data quality and model architectures. The framework is shown to be model-agnostic and extendable to transformer-based backbones and model-based RL through world-models, offering a principled path toward offline foundation-model pretraining for decision making. Overall, UNICORN provides a solid theoretical foundation and practical algorithms for robust task representation learning in COMRL, with promising implications for scalable, generalizable offline decision-making systems.

Abstract

and its latent representation

by implementing various approximate bounds. Such theoretical insight offers ample design freedom for novel algorithms. As demonstrations, we propose a supervised and a self-supervised implementation of

, and empirically show that the corresponding optimization algorithms exhibit remarkable generalization across a broad spectrum of RL benchmarks, context shift scenarios, data qualities and deep learning architectures. This work lays the information theoretic foundation for COMRL methods, leading to a better understanding of task representation learning in the context of reinforcement learning. Given its generality, we envision our framework as a promising offline pre-training paradigm of foundation models for decision making.

Paper Structure (24 sections, 3 theorems, 23 equations, 8 figures, 7 tables, 2 algorithms)

This paper contains 24 sections, 3 theorems, 23 equations, 8 figures, 7 tables, 2 algorithms.

Introduction
Method
Preliminaries, Problem Statement and Related Work
A Unified Information Theoretic Framework
Instantiations of UNICORN
Experiments
Experimental Setup
Few-Shot Generalization to In-Distribution Data
Few-Shot Generalization to Out-of-Distribution Behavior Policies
Influence of Data Quality
Discussion
Is UNICORN Model-Agnostic?
Can UNICORN be Exploited for Model-Based Paradigms?
Conclusion & Limitation
Pseudo-Code
...and 9 more sections

Key Result

Theorem 2.3

Let $\equiv$ denote equality up to a constant, then holds up to a constant, where

Figures (8)

Figure 1: Context shift of COMRL in Ant-Dir. Left: Given a task $M^i$ specified by a goal direction (dashed line), the RL agent is trained on data generated by a variety of behavior policies trained on the same task $M^i$ (red). At test time, however, the context might be collected by behavior policies trained on different tasks $\{M^j\}$ (blue), causing a context shift of OOD behavior policies (\ref{['sec:ood_experiments']}). Middle: Against OOD context, UNICORN (red) is more robust than baselines such as FOCAL (green) in terms of navigating the Ant robot towards the right direction. Right: Besides behavior policy, the task distribution (e.g., goal positions in Ant) can induce significant context shift (\ref{['sec:task_ood']}), which is also a challenging scenario for COMRL models to generalize.
Figure 2: Graphical Models of COMRL.
Figure 3: Meta-learning procedure of UNICORN-SS. The supervised variant UNICORN-SUP simply replaces the decoder by a classifier $p_{\bm{\theta}}(M|\bm{z})$ and optimize a cross-entropy loss instead of $\mathcal{L}_{\textup{recon}}$ and $\mathcal{L}_{\textup{FOCAL}}$.
Figure 4: Testing returns of UNICORN against baselines on six benchmarks. Solid curves refer to the mean performance of trials over 6 random seeds, and the shaded areas characterize the standard deviation of these trials.
Figure 5: Testing returns for OOD tasks. The learning curves are averaged over 6 random seeds.
...and 3 more figures

Theorems & Definitions (10)

Definition 2.1: Task Representation Learning
Definition 2.2: Causal Decomposition
Theorem 2.3: Central Theorem
proof
Theorem 2.4: Concentration bound for supervised UNICORN
proof
Lemma B.1
proof
proof
proof

Towards an Information Theoretic Framework of Context-Based Offline Meta-Reinforcement Learning

TL;DR

Abstract

Towards an Information Theoretic Framework of Context-Based Offline Meta-Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (10)