In-Context Decision Transformer: Reinforcement Learning via Hierarchical Chain-of-Thought
Sili Huang, Jifeng Hu, Hechang Chen, Lichao Sun, Bo Yang
TL;DR
This work tackles the computational challenge of in-context reinforcement learning over long horizons by introducing the In-context Decision Transformer (IDT), a hierarchical, three-module transformer that operates on high-level decisions rather than raw actions. By constructing a hierarchical chain of experience and generating high-level decisions every c steps to drive multi-step low-level actions, IDT significantly reduces sequence length and self-attention costs while enabling self-improvement at test time without gradient updates. Empirically, IDT achieves state-of-the-art or competitive performance on Grid World and D4RL benchmarks, with substantial online evaluation speedups (e.g., ~36x in D4RL and ~27x in large Grid World). The approach also demonstrates robust self-improvement and can leverage external demonstrations via a Reviewing Decisions module to further accelerate learning. Overall, IDT offers a scalable, efficient path to long-horizon in-context RL by integrating hierarchical decision-making into the sequence modeling framework.
Abstract
In-context learning is a promising approach for offline reinforcement learning (RL) to handle online tasks, which can be achieved by providing task prompts. Recent works demonstrated that in-context RL could emerge with self-improvement in a trial-and-error manner when treating RL tasks as an across-episodic sequential prediction problem. Despite the self-improvement not requiring gradient updates, current works still suffer from high computational costs when the across-episodic sequence increases with task horizons. To this end, we propose an In-context Decision Transformer (IDT) to achieve self-improvement in a high-level trial-and-error manner. Specifically, IDT is inspired by the efficient hierarchical structure of human decision-making and thus reconstructs the sequence to consist of high-level decisions instead of low-level actions that interact with environments. As one high-level decision can guide multi-step low-level actions, IDT naturally avoids excessively long sequences and solves online tasks more efficiently. Experimental results show that IDT achieves state-of-the-art in long-horizon tasks over current in-context RL methods. In particular, the online evaluation time of our IDT is \textbf{36$\times$} times faster than baselines in the D4RL benchmark and \textbf{27$\times$} times faster in the Grid World benchmark.
