Yes, Q-learning Helps Offline In-Context RL
Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Andrei Polubarov, Nikita Lyubaykin, Alexander Derevyagin, Igor Kiselev, Vladislav Kurenkov
TL;DR
This work demonstrates that incorporating RL objective optimization into offline in-context reinforcement learning substantially improves performance over supervised baselines like Algorithm Distillation, across a wide range of GridWorld and MuJoCo datasets and even in data-sparse XLand-Minigrid settings. By augmenting the Transformer-based IC-RL architecture with value and policy heads and applying offline RL losses, the approach aligns learning with the RL reward-maximization goal, yielding up to roughly 30% average gains and robust performance under limited dataset coverage and imperfect learning histories. The findings establish offline RL as a promising direction for advancing offline ICRL, while also revealing practical limitations and avenues for future work, including handling OOD dynamics and scaling to more complex environments. Overall, explicit RL-objective optimization in IC-RL strengthens the adaptability and effectiveness of offline meta-reinforcement learning systems in real-world scenarios where online interaction is restricted.
Abstract
Existing offline in-context reinforcement learning (ICRL) methods have predominantly relied on supervised training objectives, which are known to have limitations in offline RL settings. In this study, we explore the integration of RL objectives within an offline ICRL framework. Through experiments on more than 150 GridWorld and MuJoCo environment-derived datasets, we demonstrate that optimizing RL objectives directly improves performance by approximately 30% on average compared to widely adopted Algorithm Distillation (AD), across various dataset coverages, structures, expertise levels, and environmental complexities. Furthermore, in the challenging XLand-MiniGrid environment, RL objectives doubled the performance of AD. Our results also reveal that the addition of conservatism during value learning brings additional improvements in almost all settings tested. Our findings emphasize the importance of aligning ICRL learning objectives with the RL reward-maximization goal, and demonstrate that offline RL is a promising direction for advancing ICRL.
