Table of Contents
Fetching ...

In-Context Decision Transformer: Reinforcement Learning via Hierarchical Chain-of-Thought

Sili Huang, Jifeng Hu, Hechang Chen, Lichao Sun, Bo Yang

TL;DR

This work tackles the computational challenge of in-context reinforcement learning over long horizons by introducing the In-context Decision Transformer (IDT), a hierarchical, three-module transformer that operates on high-level decisions rather than raw actions. By constructing a hierarchical chain of experience and generating high-level decisions every c steps to drive multi-step low-level actions, IDT significantly reduces sequence length and self-attention costs while enabling self-improvement at test time without gradient updates. Empirically, IDT achieves state-of-the-art or competitive performance on Grid World and D4RL benchmarks, with substantial online evaluation speedups (e.g., ~36x in D4RL and ~27x in large Grid World). The approach also demonstrates robust self-improvement and can leverage external demonstrations via a Reviewing Decisions module to further accelerate learning. Overall, IDT offers a scalable, efficient path to long-horizon in-context RL by integrating hierarchical decision-making into the sequence modeling framework.

Abstract

In-context learning is a promising approach for offline reinforcement learning (RL) to handle online tasks, which can be achieved by providing task prompts. Recent works demonstrated that in-context RL could emerge with self-improvement in a trial-and-error manner when treating RL tasks as an across-episodic sequential prediction problem. Despite the self-improvement not requiring gradient updates, current works still suffer from high computational costs when the across-episodic sequence increases with task horizons. To this end, we propose an In-context Decision Transformer (IDT) to achieve self-improvement in a high-level trial-and-error manner. Specifically, IDT is inspired by the efficient hierarchical structure of human decision-making and thus reconstructs the sequence to consist of high-level decisions instead of low-level actions that interact with environments. As one high-level decision can guide multi-step low-level actions, IDT naturally avoids excessively long sequences and solves online tasks more efficiently. Experimental results show that IDT achieves state-of-the-art in long-horizon tasks over current in-context RL methods. In particular, the online evaluation time of our IDT is \textbf{36$\times$} times faster than baselines in the D4RL benchmark and \textbf{27$\times$} times faster in the Grid World benchmark.

In-Context Decision Transformer: Reinforcement Learning via Hierarchical Chain-of-Thought

TL;DR

This work tackles the computational challenge of in-context reinforcement learning over long horizons by introducing the In-context Decision Transformer (IDT), a hierarchical, three-module transformer that operates on high-level decisions rather than raw actions. By constructing a hierarchical chain of experience and generating high-level decisions every c steps to drive multi-step low-level actions, IDT significantly reduces sequence length and self-attention costs while enabling self-improvement at test time without gradient updates. Empirically, IDT achieves state-of-the-art or competitive performance on Grid World and D4RL benchmarks, with substantial online evaluation speedups (e.g., ~36x in D4RL and ~27x in large Grid World). The approach also demonstrates robust self-improvement and can leverage external demonstrations via a Reviewing Decisions module to further accelerate learning. Overall, IDT offers a scalable, efficient path to long-horizon in-context RL by integrating hierarchical decision-making into the sequence modeling framework.

Abstract

In-context learning is a promising approach for offline reinforcement learning (RL) to handle online tasks, which can be achieved by providing task prompts. Recent works demonstrated that in-context RL could emerge with self-improvement in a trial-and-error manner when treating RL tasks as an across-episodic sequential prediction problem. Despite the self-improvement not requiring gradient updates, current works still suffer from high computational costs when the across-episodic sequence increases with task horizons. To this end, we propose an In-context Decision Transformer (IDT) to achieve self-improvement in a high-level trial-and-error manner. Specifically, IDT is inspired by the efficient hierarchical structure of human decision-making and thus reconstructs the sequence to consist of high-level decisions instead of low-level actions that interact with environments. As one high-level decision can guide multi-step low-level actions, IDT naturally avoids excessively long sequences and solves online tasks more efficiently. Experimental results show that IDT achieves state-of-the-art in long-horizon tasks over current in-context RL methods. In particular, the online evaluation time of our IDT is \textbf{36} times faster than baselines in the D4RL benchmark and \textbf{27} times faster in the Grid World benchmark.
Paper Structure (16 sections, 5 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 5 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Trial-and-error comparison of minimal actions and high-level decisions, where * denotes better results. (a) In the trial-and-error process, the memory consists of the smallest actions from experiences and serves as context to search for better action. (b) In the high-level trial-and-error process, the memory and search act on high-level decisions. Since one high-level decision controls multiple actions, we can use smaller memory to preserve experiences and search for better decisions with less computational costs.
  • Figure 2: The architecture of IDT is designed into three modules to simulate the high-level trial-and-error process. First, the (1) Making Decisions module predicts a high-level decision by providing across-episodic contexts, where across-episodic contexts contain multiple trajectories arranged in ascending order of the total rewards. Then, the (2) Decisions to Go module predicts actions for $c$ steps conditioned on the predicted high-level decision. Finally, the (3) Reviewing Decisions module reviews the executed actions to serve as an experience for the next cycle. Note that the Reviewing Decisions encodes the true label of high-level decisions from offline data at training while encodes from the executed actions at testing.
  • Figure 3: Results for (a) testing and (b) training times. We report the training time per 10k gradient updates, the testing time for 50 episodes over Grid World, and 10 episodes over D4RL. Note that we use the number of steps to measure the context size here. The number of tokens per step may vary depending on the algorithm. Each step in AD contains 4 tokens: observation, action, reward, and completion. IDT's Making Decisions module and AT have an extra return-to-go token. As the task length increases, the context length is forced to grow exponentially, resulting in a square increase in computational costs. In contrast, IDT reconstructs the sequence to consist of high-level decisions. Therefore, the context is smaller than one episode length, significantly reducing computational costs.
  • Figure 4: Results for Grid World. An agent is expected to solve a new task by interacting with the environments for 50 episodes without online model updates. Based on high-level decisions, our method outperforms both AT and AD, which rely on across-episodic contexts with the smallest actions. In particular, IDT has significant advantages in handling long-horizon tasks.
  • Figure 5: Results for IDT conditioned on partial demonstrations. IDT can accelerate self-improvement through the Review Decisions module to encode external data prompts.
  • ...and 2 more figures