Table of Contents
Fetching ...

Random Policy Enables In-Context Reinforcement Learning within Trust Horizons

Weiqin Chen, Santiago Paternain

TL;DR

This paper tackles the practical limitations of in-context reinforcement learning (ICRL) by showing that effective ICRL does not require access to optimal or well-trained policies during pretraining. It introduces State-Action Distillation (SAD), a method that builds pretraining data from interactions under random policies within a trust horizon, coupled with an autoregressive supervised pretraining objective. The authors provide formal trustworthiness and performance guarantees for SAD and demonstrate significant offline and online improvements over SOTA baselines across multiple benchmark environments, including bandits and sparse-reward grid-worlds. The work suggests that random-policy SAD greatly enhances the real-world applicability of ICRL, while noting current limitations to discrete actions and pointing to future work on continuous-action settings and more complex environments.

Abstract

Pretrained foundation models have exhibited extraordinary in-context learning performance, allowing zero-shot generalization to new tasks not encountered during pretraining. In the case of reinforcement learning (RL), in-context RL (ICRL) emerges when pretraining FMs on decision-making problems in an autoregressive-supervised manner. Nevertheless, current state-of-the-art ICRL algorithms, like Algorithm Distillation, Decision Pretrained Transformer and Decision Importance Transformer, impose stringent requirements on the pretraining dataset concerning the source policies, context information, and action labels. Notably, these algorithms either demand optimal policies or require varying degrees of well-trained behavior policies for all pretraining environments. This significantly hinders the application of ICRL to real-world scenarios, where acquiring optimal or well-trained policies for a substantial volume of real-world training environments can be intractable. To overcome this challenge, we introduce a novel approach, termed State-Action Distillation (SAD), that allows to generate an effective pretraining dataset guided solely by random policies. In particular, SAD selects query states and corresponding action labels by distilling outstanding state-action pairs from the entire state and action spaces by using random policies within a trust horizon, and then inherits the classical autoregressive-supervised mechanism during pretraining. To the best of our knowledge, this is the first work that enables effective ICRL under random policies and random contexts. We also establish quantitative analysis of the trustworthiness as well as the performance guarantees of SAD. Moreover, our empirical results across multiple popular ICRL benchmark environments demonstrate that, on average, SAD outperforms the best baseline by 236.3% in the offline evaluation and by 135.2% in the online evaluation.

Random Policy Enables In-Context Reinforcement Learning within Trust Horizons

TL;DR

This paper tackles the practical limitations of in-context reinforcement learning (ICRL) by showing that effective ICRL does not require access to optimal or well-trained policies during pretraining. It introduces State-Action Distillation (SAD), a method that builds pretraining data from interactions under random policies within a trust horizon, coupled with an autoregressive supervised pretraining objective. The authors provide formal trustworthiness and performance guarantees for SAD and demonstrate significant offline and online improvements over SOTA baselines across multiple benchmark environments, including bandits and sparse-reward grid-worlds. The work suggests that random-policy SAD greatly enhances the real-world applicability of ICRL, while noting current limitations to discrete actions and pointing to future work on continuous-action settings and more complex environments.

Abstract

Pretrained foundation models have exhibited extraordinary in-context learning performance, allowing zero-shot generalization to new tasks not encountered during pretraining. In the case of reinforcement learning (RL), in-context RL (ICRL) emerges when pretraining FMs on decision-making problems in an autoregressive-supervised manner. Nevertheless, current state-of-the-art ICRL algorithms, like Algorithm Distillation, Decision Pretrained Transformer and Decision Importance Transformer, impose stringent requirements on the pretraining dataset concerning the source policies, context information, and action labels. Notably, these algorithms either demand optimal policies or require varying degrees of well-trained behavior policies for all pretraining environments. This significantly hinders the application of ICRL to real-world scenarios, where acquiring optimal or well-trained policies for a substantial volume of real-world training environments can be intractable. To overcome this challenge, we introduce a novel approach, termed State-Action Distillation (SAD), that allows to generate an effective pretraining dataset guided solely by random policies. In particular, SAD selects query states and corresponding action labels by distilling outstanding state-action pairs from the entire state and action spaces by using random policies within a trust horizon, and then inherits the classical autoregressive-supervised mechanism during pretraining. To the best of our knowledge, this is the first work that enables effective ICRL under random policies and random contexts. We also establish quantitative analysis of the trustworthiness as well as the performance guarantees of SAD. Moreover, our empirical results across multiple popular ICRL benchmark environments demonstrate that, on average, SAD outperforms the best baseline by 236.3% in the offline evaluation and by 135.2% in the online evaluation.

Paper Structure

This paper contains 44 sections, 7 theorems, 82 equations, 13 figures, 5 tables, 6 algorithms.

Key Result

Theorem 1

Let Assumption assumption_bound_reward hold. The random policy is at least $(1-\delta)$-trustworthy as in Definition definition_mab, when the trust horizon $N$ satisfies

Figures (13)

  • Figure 1: Schematic of the State-Action Distillation approach: i) Collecting the context by using the random policy to interact with pretraining environments. ii) Sampling a query state randomly from the state space. iii) Starting from the query state and any action in action space, running trust horizons under the random policy, and distilling the action label by the action that yields the maximal return. iv) Pretraining foundation models in a supervised mechanism, which predicts the action label given the context and query state.
  • Figure 2: A single-dimensional grid world MDP comprising five states $\{s_0, s_1, s_2, s_3, s_4\}$, where $s_0$ represents the goal state (golden star). The environment offers two possible actions: $a_0$ (go left), and $a_1$ (go right). Any transitions that would result in (left or right) boundary crossing will be confined to the current position. The reward structure is sparse, with a value of 1 received solely upon reaching the unique goal state $s_0$ and a value of 0 otherwise. We consider an infinite time horizon with a discounter factor $\gamma$.
  • Figure 3: Offline and online evaluations of ICRL algorithms trained under a uniform random policy: AD, DPT, DIT, DPT$^*$, and SAD (ours). Each algorithm contains four independent runs with mean and standard deviation. Gaussian Bandits: (a) and (b), Bernoulli Bandits: (c) and (d).
  • Figure 4: Offline and online evaluations of ICRL algorithms trained under a uniform random policy: AD, DPT, DIT, DPT$^*$, and SAD (ours). Each algorithm contains four independent runs with mean and standard deviation. DarkRoom: (a) and (b). DarkRoom-Large: (c) and (d).
  • Figure 5: Offline and online evaluations of ICRL algorithms trained under a uniform random policy: AD, DPT, DIT, DPT$^*$, and SAD (ours). Each algorithm contains four independent runs with mean and standard deviation. Environment: Miniworld.
  • ...and 8 more figures

Theorems & Definitions (9)

  • Definition 1: MAB
  • Definition 2: MDP
  • Theorem 1: MAB
  • Theorem 2: MDP
  • Corollary 1
  • Corollary 2
  • Lemma 1
  • Proposition 1
  • Lemma 2