Random Policy Enables In-Context Reinforcement Learning within Trust Horizons
Weiqin Chen, Santiago Paternain
TL;DR
This paper tackles the practical limitations of in-context reinforcement learning (ICRL) by showing that effective ICRL does not require access to optimal or well-trained policies during pretraining. It introduces State-Action Distillation (SAD), a method that builds pretraining data from interactions under random policies within a trust horizon, coupled with an autoregressive supervised pretraining objective. The authors provide formal trustworthiness and performance guarantees for SAD and demonstrate significant offline and online improvements over SOTA baselines across multiple benchmark environments, including bandits and sparse-reward grid-worlds. The work suggests that random-policy SAD greatly enhances the real-world applicability of ICRL, while noting current limitations to discrete actions and pointing to future work on continuous-action settings and more complex environments.
Abstract
Pretrained foundation models have exhibited extraordinary in-context learning performance, allowing zero-shot generalization to new tasks not encountered during pretraining. In the case of reinforcement learning (RL), in-context RL (ICRL) emerges when pretraining FMs on decision-making problems in an autoregressive-supervised manner. Nevertheless, current state-of-the-art ICRL algorithms, like Algorithm Distillation, Decision Pretrained Transformer and Decision Importance Transformer, impose stringent requirements on the pretraining dataset concerning the source policies, context information, and action labels. Notably, these algorithms either demand optimal policies or require varying degrees of well-trained behavior policies for all pretraining environments. This significantly hinders the application of ICRL to real-world scenarios, where acquiring optimal or well-trained policies for a substantial volume of real-world training environments can be intractable. To overcome this challenge, we introduce a novel approach, termed State-Action Distillation (SAD), that allows to generate an effective pretraining dataset guided solely by random policies. In particular, SAD selects query states and corresponding action labels by distilling outstanding state-action pairs from the entire state and action spaces by using random policies within a trust horizon, and then inherits the classical autoregressive-supervised mechanism during pretraining. To the best of our knowledge, this is the first work that enables effective ICRL under random policies and random contexts. We also establish quantitative analysis of the trustworthiness as well as the performance guarantees of SAD. Moreover, our empirical results across multiple popular ICRL benchmark environments demonstrate that, on average, SAD outperforms the best baseline by 236.3% in the offline evaluation and by 135.2% in the online evaluation.
