Prompt Tuning Decision Transformers with Structured and Scalable Bandits
Finn Rietz, Oleg Smirnov, Sara Karimi, Lele Cao
TL;DR
This work tackles offline multi-task reinforcement learning with Prompting Decision Transformers (PDT) by addressing the inefficiency of uniformly sampling trajectory prompts. It introduces a bandit-based prompt-tuning framework that constructs optimized trajectory prompts at inference time, using a structured bandit architecture with one arm per prompt segment, and leverages the pre-trained PDT as a fixed feature extractor to keep inputs compact. The authors prove a regret bound for the bandit, showing Regret(K) ≤ (1/J) Σ_j Regret_j(K) + 2Kε, and demonstrate sublinear growth under standard assumptions, while empirically validating improvements across MuJoCo, Meta-World, and OOD scenarios. The approach provides a scalable, data-efficient means to adapt PDT to new tasks without backbone finetuning, enhancing robustness and generalization in offline settings with high-dimensional observations.
Abstract
Prompt tuning has emerged as a key technique for adapting large pre-trained Decision Transformers (DTs) in offline Reinforcement Learning (RL), particularly in multi-task and few-shot settings. The Prompting Decision Transformer (PDT) enables task generalization via trajectory prompts sampled uniformly from expert demonstrations -- without accounting for prompt informativeness. In this work, we propose a bandit-based prompt-tuning method that learns to construct optimal trajectory prompts from demonstration data at inference time. We devise a structured bandit architecture operating in the trajectory prompt space, achieving linear rather than combinatorial scaling with prompt size. Additionally, we show that the pre-trained PDT itself can serve as a powerful feature extractor for the bandit, enabling efficient reward modeling across various environments. We theoretically establish regret bounds and demonstrate empirically that our method consistently enhances performance across a wide range of tasks, high-dimensional environments, and out-of-distribution scenarios, outperforming existing baselines in prompt tuning.
