Table of Contents
Fetching ...

In-Context Curiosity: Distilling Exploration for Decision-Pretrained Transformers on Bandit Tasks

Huitao Yang, Guanting Chen

TL;DR

The paper tackles the generalization gap of Decision-Pretrained Transformers (DPTs) when pretraining data is biased or limited. It introduces in-context curiosity, an offline exploration regularizer, and the Prediction-Powered Transformer (PPT), which adds a reward predictor that injects curiosity into pretraining to improve out-of-distribution robustness in bandit tasks. In Gaussian bandit experiments, PPT reduces variance-driven performance degradation and demonstrates improved exploration compared to DPT, especially under tricky pretraining data, while highlighting data quality as a fundamental constraint. Overall, the work presents curiosity-driven pretraining as a practical, scalable direction to enhance in-context RL generalization, with clear pathways for extension to stateful environments and adaptive exploration strategies.

Abstract

As large language models (LLMs) continue to grow in capability, there is increasing interest in incorporating them into decision-making tasks. A common pipeline for this is Decision-Pretrained Transformers (DPTs). However, existing training methods for DPTs often struggle to generalize beyond their pretraining data distribution. To explore mitigation of this limitation, we propose in-context curiosity -- a lightweight, exploration-inspired regularizer for offline pretraining -- and introduce the Prediction-Powered Transformer (PPT) framework. PPT augments DPT with an auxiliary reward predictor, using prediction error as an intrinsic curiosity signal to encourage broader exploration during training. In proof-of-concept experiments on Gaussian multi-armed bandits, PPT shows improved robustness: it moderates the performance degradation observed in DPT when test environments exhibit higher variance in reward, particularly when pretraining data has limited diversity. While the quality of offline data remain fundamental, our preliminary results suggest that curiosity-driven pretraining offers a promising direction for enhancing out-of-distribution generalization in in-context RL agents.

In-Context Curiosity: Distilling Exploration for Decision-Pretrained Transformers on Bandit Tasks

TL;DR

The paper tackles the generalization gap of Decision-Pretrained Transformers (DPTs) when pretraining data is biased or limited. It introduces in-context curiosity, an offline exploration regularizer, and the Prediction-Powered Transformer (PPT), which adds a reward predictor that injects curiosity into pretraining to improve out-of-distribution robustness in bandit tasks. In Gaussian bandit experiments, PPT reduces variance-driven performance degradation and demonstrates improved exploration compared to DPT, especially under tricky pretraining data, while highlighting data quality as a fundamental constraint. Overall, the work presents curiosity-driven pretraining as a practical, scalable direction to enhance in-context RL generalization, with clear pathways for extension to stateful environments and adaptive exploration strategies.

Abstract

As large language models (LLMs) continue to grow in capability, there is increasing interest in incorporating them into decision-making tasks. A common pipeline for this is Decision-Pretrained Transformers (DPTs). However, existing training methods for DPTs often struggle to generalize beyond their pretraining data distribution. To explore mitigation of this limitation, we propose in-context curiosity -- a lightweight, exploration-inspired regularizer for offline pretraining -- and introduce the Prediction-Powered Transformer (PPT) framework. PPT augments DPT with an auxiliary reward predictor, using prediction error as an intrinsic curiosity signal to encourage broader exploration during training. In proof-of-concept experiments on Gaussian multi-armed bandits, PPT shows improved robustness: it moderates the performance degradation observed in DPT when test environments exhibit higher variance in reward, particularly when pretraining data has limited diversity. While the quality of offline data remain fundamental, our preliminary results suggest that curiosity-driven pretraining offers a promising direction for enhancing out-of-distribution generalization in in-context RL agents.

Paper Structure

This paper contains 28 sections, 16 equations, 10 figures, 1 algorithm.

Figures (10)

  • Figure 1: An illustrative diagram of in-context curiosity during pretraining. An additional round of prediction and self-reflection is incorporated to encourage exploration.
  • Figure 2: Average regret across increasing $\sigma^2_{\text{test}}$ for (left) ideal and (right) tricky pretraining data. In both settings, PPT shows a lower variance–induced degradation than DPT; the effect is smaller under ideal data and larger under tricky data.
  • Figure 3: Representative performance of PPT and DPT on test environments with $\sigma^2_{\text{test}}=0.3$ (top), $\sigma^2_{\text{test}}=0.5$ (middle) and $\sigma^2_{\text{test}}=0.9$ (bottom) using a "tricky" \ref{['def:tky']} (more biased towards expert policy, low variance) pretraining data. PPT models exhibit improved generalization relative to DPT.
  • Figure 4: Average regret as test variance $\sigma^2_{\text{test}}$ increases ($\geq0.5$). Left: results with ideal pretraining data. Right: results with tricky pretraining data. As the test environments become more variable, the performance gap diminishes as test variance increases, with PPT converging toward DPT’s regret levels.
  • Figure 5: Policy loss shares similar dynamics with predictor loss during training.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Definition A.1: Gaussian Bandit
  • Definition A.2: Ideal dataset
  • Definition A.3: Tricky dataset