Table of Contents
Fetching ...

Yes, Q-learning Helps Offline In-Context RL

Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Andrei Polubarov, Nikita Lyubaykin, Alexander Derevyagin, Igor Kiselev, Vladislav Kurenkov

TL;DR

This work demonstrates that incorporating RL objective optimization into offline in-context reinforcement learning substantially improves performance over supervised baselines like Algorithm Distillation, across a wide range of GridWorld and MuJoCo datasets and even in data-sparse XLand-Minigrid settings. By augmenting the Transformer-based IC-RL architecture with value and policy heads and applying offline RL losses, the approach aligns learning with the RL reward-maximization goal, yielding up to roughly 30% average gains and robust performance under limited dataset coverage and imperfect learning histories. The findings establish offline RL as a promising direction for advancing offline ICRL, while also revealing practical limitations and avenues for future work, including handling OOD dynamics and scaling to more complex environments. Overall, explicit RL-objective optimization in IC-RL strengthens the adaptability and effectiveness of offline meta-reinforcement learning systems in real-world scenarios where online interaction is restricted.

Abstract

Existing offline in-context reinforcement learning (ICRL) methods have predominantly relied on supervised training objectives, which are known to have limitations in offline RL settings. In this study, we explore the integration of RL objectives within an offline ICRL framework. Through experiments on more than 150 GridWorld and MuJoCo environment-derived datasets, we demonstrate that optimizing RL objectives directly improves performance by approximately 30% on average compared to widely adopted Algorithm Distillation (AD), across various dataset coverages, structures, expertise levels, and environmental complexities. Furthermore, in the challenging XLand-MiniGrid environment, RL objectives doubled the performance of AD. Our results also reveal that the addition of conservatism during value learning brings additional improvements in almost all settings tested. Our findings emphasize the importance of aligning ICRL learning objectives with the RL reward-maximization goal, and demonstrate that offline RL is a promising direction for advancing ICRL.

Yes, Q-learning Helps Offline In-Context RL

TL;DR

This work demonstrates that incorporating RL objective optimization into offline in-context reinforcement learning substantially improves performance over supervised baselines like Algorithm Distillation, across a wide range of GridWorld and MuJoCo datasets and even in data-sparse XLand-Minigrid settings. By augmenting the Transformer-based IC-RL architecture with value and policy heads and applying offline RL losses, the approach aligns learning with the RL reward-maximization goal, yielding up to roughly 30% average gains and robust performance under limited dataset coverage and imperfect learning histories. The findings establish offline RL as a promising direction for advancing offline ICRL, while also revealing practical limitations and avenues for future work, including handling OOD dynamics and scaling to more complex environments. Overall, explicit RL-objective optimization in IC-RL strengthens the adaptability and effectiveness of offline meta-reinforcement learning systems in real-world scenarios where online interaction is restricted.

Abstract

Existing offline in-context reinforcement learning (ICRL) methods have predominantly relied on supervised training objectives, which are known to have limitations in offline RL settings. In this study, we explore the integration of RL objectives within an offline ICRL framework. Through experiments on more than 150 GridWorld and MuJoCo environment-derived datasets, we demonstrate that optimizing RL objectives directly improves performance by approximately 30% on average compared to widely adopted Algorithm Distillation (AD), across various dataset coverages, structures, expertise levels, and environmental complexities. Furthermore, in the challenging XLand-MiniGrid environment, RL objectives doubled the performance of AD. Our results also reveal that the addition of conservatism during value learning brings additional improvements in almost all settings tested. Our findings emphasize the importance of aligning ICRL learning objectives with the RL reward-maximization goal, and demonstrate that offline RL is a promising direction for advancing ICRL.

Paper Structure

This paper contains 42 sections, 2 equations, 27 figures, 50 tables.

Figures (27)

  • Figure 1: Mean test NAUC scores across environments averaged over all constructed datasets. NAUC is a normalized AUC of the test-time performance curve, see \ref{['sub:eval']} for details.
  • Figure 2: Overview of the proposed approach. As the input, our model takes a sequence of trajectories (without hard requirements on their structure) where each transition is represented with a tuple consisting of previous action, previous reward, previous episode's done flag, current episode timestep and other sequence elements marked by different timestep subscripts ($t$ and $T$) to indicate their potential origin from distinct trajectories. Then the resulting context embedding $c_t$ is used to predict both value functions and the policy output $\pi$. The V-head is employed only in IQL, while the $\pi$ head is used exclusively for continuous actions. Dashed arrows denote the absence of gradient flow.
  • Figure 3: Top: algorithms performance across tracked metrics averaged over all discrete datasets. Bottom: rliable performance profiles of NAUC. Left: train targets. Right: test targets.
  • Figure 4: NAUC score comparison between considered approaches for various dataset coverage in terms of number of train targets and histories per target. Averaged over 4 test random seeds. Confidence intervals depict std across seeds.
  • Figure 5: rliable performance profiles of NAUC for various discrete datasets expertise. Top, from left to right: early, mid, late datasets. Bottom: complete learning histories.
  • ...and 22 more figures