Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning

Subhojyoti Mukherjee; Josiah P. Hanna; Qiaomin Xie; Robert Nowak

Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning

Subhojyoti Mukherjee, Josiah P. Hanna, Qiaomin Xie, Robert Nowak

TL;DR

This work addresses learning-to-learn in multi-task structured bandits and the challenge of deriving a near-optimal in-context policy without access to optimal actions. It introduces PreDeToR, a pretraining strategy that trains a transformer to predict per-action rewards from short in-context histories, enabling in-context learning that exploits shared structure across tasks. Empirically, PreDeToR matches or surpasses baseline in-context methods across linear, nonlinear, bilinear, and latent bandits, while theory provides generalization guarantees showing transfer risk decreases as the number of source tasks increases. The approach reduces reliance on privileged data and supports rapid adaptation to unseen tasks, with implications for recommendation, exploration, and offline transfer learning across structured decision problems.

Abstract

We study learning to learn for the multi-task structured bandit problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and an algorithm should exploit the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer as a decision-making algorithm to learn this shared structure from data collected by a demonstrator on a set of training task instances. Our objective is to devise a training procedure such that the transformer will learn to outperform the demonstrator's learning algorithm on unseen test task instances. Prior work on pretraining decision transformers either requires privileged information like access to optimal arms or cannot outperform the demonstrator. Going beyond these approaches, we introduce a pre-training approach that trains a transformer network to learn a near-optimal policy in-context. This approach leverages the shared structure across tasks, does not require access to optimal actions, and can outperform the demonstrator. We validate these claims over a wide variety of structured bandit problems to show that our proposed solution is general and can quickly identify expected rewards on unseen test tasks to support effective exploration.

Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning

TL;DR

Abstract

Paper Structure (38 sections, 3 theorems, 54 equations, 21 figures, 1 table, 2 algorithms)

This paper contains 38 sections, 3 theorems, 54 equations, 21 figures, 1 table, 2 algorithms.

Introduction
Contributions
Background
Preliminaries
In-Context Learning Model
Related In-context Learning Algorithms
The PreDeToR Algorithm
Pre-training Next Reward Prediction
Deploying PreDeToR
Empirical Study: Non-Linear Structure
Empirical Study: Linear Structure and Understanding PreDeToR's Exploration
Empirical Study: Importance of Shared Structure and Introducing New Actions
Data Collection Analysis
Theoretical Analysis of Generalization
Conclusions, Limitations and Future Works
...and 23 more sections

Key Result

Theorem 8.2

(PreDeToR risk) Suppose error stability assm:stability-assumption holds and assume loss function $\ell(\cdot,\cdot)$ is $C$-Lipschitz for all $r_t \in [0,B]$ and horizon $n\geq 1$. Let $\widehat{\mathrm{TF}}$ be the empirical solution of (ERM) and $\mathcal{N}(\mathcal{A}, \rho, \epsilon)$ be the co where $\mathcal{N}(\mathrm{Alg}, \rho, \varepsilon)$ is the covering number of transformer $\wideha

Figures (21)

Figure 1: Non-linear regime. The horizontal axis is the number of rounds. Confidence bars show one standard error.
Figure 2: Linear Expt. The horizontal axis is the number of rounds. Confidence bars show one standard error.
Figure 3: Exploration Analysis of PreDeToR (-$\tau$)
Figure 4: Linear new action experiments. The horizontal axis is the number of rounds. Confidence bars show one standard error.
Figure 5: Non-linear new action experiments with non-linear setting.
...and 16 more figures

Theorems & Definitions (11)

Theorem 8.2
proof
Theorem C.1
proof
Definition C.2
Definition C.3
Remark C.4
Remark C.5
Theorem C.6
proof
...and 1 more

Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning

TL;DR

Abstract

Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (11)