Table of Contents
Fetching ...

Prompt Tuning Decision Transformers with Structured and Scalable Bandits

Finn Rietz, Oleg Smirnov, Sara Karimi, Lele Cao

TL;DR

This work tackles offline multi-task reinforcement learning with Prompting Decision Transformers (PDT) by addressing the inefficiency of uniformly sampling trajectory prompts. It introduces a bandit-based prompt-tuning framework that constructs optimized trajectory prompts at inference time, using a structured bandit architecture with one arm per prompt segment, and leverages the pre-trained PDT as a fixed feature extractor to keep inputs compact. The authors prove a regret bound for the bandit, showing Regret(K) ≤ (1/J) Σ_j Regret_j(K) + 2Kε, and demonstrate sublinear growth under standard assumptions, while empirically validating improvements across MuJoCo, Meta-World, and OOD scenarios. The approach provides a scalable, data-efficient means to adapt PDT to new tasks without backbone finetuning, enhancing robustness and generalization in offline settings with high-dimensional observations.

Abstract

Prompt tuning has emerged as a key technique for adapting large pre-trained Decision Transformers (DTs) in offline Reinforcement Learning (RL), particularly in multi-task and few-shot settings. The Prompting Decision Transformer (PDT) enables task generalization via trajectory prompts sampled uniformly from expert demonstrations -- without accounting for prompt informativeness. In this work, we propose a bandit-based prompt-tuning method that learns to construct optimal trajectory prompts from demonstration data at inference time. We devise a structured bandit architecture operating in the trajectory prompt space, achieving linear rather than combinatorial scaling with prompt size. Additionally, we show that the pre-trained PDT itself can serve as a powerful feature extractor for the bandit, enabling efficient reward modeling across various environments. We theoretically establish regret bounds and demonstrate empirically that our method consistently enhances performance across a wide range of tasks, high-dimensional environments, and out-of-distribution scenarios, outperforming existing baselines in prompt tuning.

Prompt Tuning Decision Transformers with Structured and Scalable Bandits

TL;DR

This work tackles offline multi-task reinforcement learning with Prompting Decision Transformers (PDT) by addressing the inefficiency of uniformly sampling trajectory prompts. It introduces a bandit-based prompt-tuning framework that constructs optimized trajectory prompts at inference time, using a structured bandit architecture with one arm per prompt segment, and leverages the pre-trained PDT as a fixed feature extractor to keep inputs compact. The authors prove a regret bound for the bandit, showing Regret(K) ≤ (1/J) Σ_j Regret_j(K) + 2Kε, and demonstrate sublinear growth under standard assumptions, while empirically validating improvements across MuJoCo, Meta-World, and OOD scenarios. The approach provides a scalable, data-efficient means to adapt PDT to new tasks without backbone finetuning, enhancing robustness and generalization in offline settings with high-dimensional observations.

Abstract

Prompt tuning has emerged as a key technique for adapting large pre-trained Decision Transformers (DTs) in offline Reinforcement Learning (RL), particularly in multi-task and few-shot settings. The Prompting Decision Transformer (PDT) enables task generalization via trajectory prompts sampled uniformly from expert demonstrations -- without accounting for prompt informativeness. In this work, we propose a bandit-based prompt-tuning method that learns to construct optimal trajectory prompts from demonstration data at inference time. We devise a structured bandit architecture operating in the trajectory prompt space, achieving linear rather than combinatorial scaling with prompt size. Additionally, we show that the pre-trained PDT itself can serve as a powerful feature extractor for the bandit, enabling efficient reward modeling across various environments. We theoretically establish regret bounds and demonstrate empirically that our method consistently enhances performance across a wide range of tasks, high-dimensional environments, and out-of-distribution scenarios, outperforming existing baselines in prompt tuning.

Paper Structure

This paper contains 25 sections, 3 theorems, 16 equations, 11 figures, 8 tables, 3 algorithms.

Key Result

Theorem 4.1

Assume that the reward function $G\colon P^J \to \mathbb{R}$ for a prompt $\rho = (\tilde{\tau}_1,\dots,\tilde{\tau}_J)$ decomposes as the mean of $J$ independent reward models $\phi_j(\tilde{\tau}_j)$ and that the interaction term $h$ is uniformly bounded by $|h(\tilde{\tau}_1,\dots,\tilde{\tau}_J)| \leq \varepsilon,\quad \forall\, \tilde{\tau}_j \in P.$ Let $\rho^* = (\tilde{\tau}^*_1,\dots,\til

Figures (11)

  • Figure 1: Overview of our bandit-based prompt-tuning method for multi-task learning with PDT. Each $z_i$ represents a triplet $(\hat{r}_i, \mathbf{s}_i, \mathbf{a}_i)$, each $\tilde{\tau}$ represents a prompt segment, each $\tau$ represents a demonstration trajectory. The bandit explores the demonstration dataset $\mathcal{P}_i$ for the current task $i$ to find the best prompt $\rho^* = (\tilde{\tau}_1^*, \dots, \tilde{\tau}_J^*)$. The online return $G_k$ achieved by the underlying PDT model at round $k$ and using prompt $\rho_k$ serves as a reward for the bandit.
  • Figure 2: Multiple tasks.
  • Figure 3: Trajectories.
  • Figure 5: (a) Visualization of prompt selection across tasks in the Sparse 2D Point environment. (b) Inference-time performance showing the benefit of prompt tuning over a single-task baseline.
  • Figure 6: Performance of the structured (ours) and standard MAB methods on a synthetic prompt tuning task. Problem instances are generated by sweeping over $J = \{2, 3, 4, 5\}$ segments and $H = \{3, 5, 7, 10\}$ choices, with problem size reported as $H^J$. Shaded regions indicate one standard deviation around the mean; results are averaged over three random seeds.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Theorem 4.1
  • Corollary 4.2
  • Theorem
  • proof