Table of Contents
Fetching ...

Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling

Sili Huang, Jifeng Hu, Zhejian Yang, Liwei Yang, Tao Luo, Hechang Chen, Lichao Sun, Bo Yang

TL;DR

This work tackles the efficiency bottleneck of long-horizon RL with transformers by introducing Decision Mamba-Hybrid (DM-H), a hybrid architecture that leverages Mamba’s linear-scaling state-space modeling to generate long-term sub-goals and uses a Transformer for high-quality action prediction conditioned on these sub-goals. DM-H bridges long-term memory and precise short-term decisions by reconstructing short-term sequences around valuable sub-goals, guided by across-episodic contexts sorted by return. Empirical results across Grid World, Tmaze, and D4RL show that DM-H achieves state-of-the-art performance on long- and short-horizon tasks while delivering substantial online-speed advantages (e.g., ~$28\times$ faster than transformer baselines). The approach demonstrates robust self-improvement in online settings without gradient updates during deployment, highlighting a scalable path for in-context RL with long-term memory.

Abstract

Recent works have shown the remarkable superiority of transformer models in reinforcement learning (RL), where the decision-making problem is formulated as sequential generation. Transformer-based agents could emerge with self-improvement in online environments by providing task contexts, such as multiple trajectories, called in-context RL. However, due to the quadratic computation complexity of attention in transformers, current in-context RL methods suffer from huge computational costs as the task horizon increases. In contrast, the Mamba model is renowned for its efficient ability to process long-term dependencies, which provides an opportunity for in-context RL to solve tasks that require long-term memory. To this end, we first implement Decision Mamba (DM) by replacing the backbone of Decision Transformer (DT). Then, we propose a Decision Mamba-Hybrid (DM-H) with the merits of transformers and Mamba in high-quality prediction and long-term memory. Specifically, DM-H first generates high-value sub-goals from long-term memory through the Mamba model. Then, we use sub-goals to prompt the transformer, establishing high-quality predictions. Experimental results demonstrate that DM-H achieves state-of-the-art in long and short-term tasks, such as D4RL, Grid World, and Tmaze benchmarks. Regarding efficiency, the online testing of DM-H in the long-term task is 28$\times$ times faster than the transformer-based baselines.

Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling

TL;DR

This work tackles the efficiency bottleneck of long-horizon RL with transformers by introducing Decision Mamba-Hybrid (DM-H), a hybrid architecture that leverages Mamba’s linear-scaling state-space modeling to generate long-term sub-goals and uses a Transformer for high-quality action prediction conditioned on these sub-goals. DM-H bridges long-term memory and precise short-term decisions by reconstructing short-term sequences around valuable sub-goals, guided by across-episodic contexts sorted by return. Empirical results across Grid World, Tmaze, and D4RL show that DM-H achieves state-of-the-art performance on long- and short-horizon tasks while delivering substantial online-speed advantages (e.g., ~ faster than transformer baselines). The approach demonstrates robust self-improvement in online settings without gradient updates during deployment, highlighting a scalable path for in-context RL with long-term memory.

Abstract

Recent works have shown the remarkable superiority of transformer models in reinforcement learning (RL), where the decision-making problem is formulated as sequential generation. Transformer-based agents could emerge with self-improvement in online environments by providing task contexts, such as multiple trajectories, called in-context RL. However, due to the quadratic computation complexity of attention in transformers, current in-context RL methods suffer from huge computational costs as the task horizon increases. In contrast, the Mamba model is renowned for its efficient ability to process long-term dependencies, which provides an opportunity for in-context RL to solve tasks that require long-term memory. To this end, we first implement Decision Mamba (DM) by replacing the backbone of Decision Transformer (DT). Then, we propose a Decision Mamba-Hybrid (DM-H) with the merits of transformers and Mamba in high-quality prediction and long-term memory. Specifically, DM-H first generates high-value sub-goals from long-term memory through the Mamba model. Then, we use sub-goals to prompt the transformer, establishing high-quality predictions. Experimental results demonstrate that DM-H achieves state-of-the-art in long and short-term tasks, such as D4RL, Grid World, and Tmaze benchmarks. Regarding efficiency, the online testing of DM-H in the long-term task is 28 times faster than the transformer-based baselines.
Paper Structure (18 sections, 7 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 7 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: The architecture of DM-H. During offline training, Mamba module generates sub-goals from long-term experience, where the long-term experience consists of multiple historical trajectories arranged in ascending order of the total rewards. Based on the generated sub-goals, the transformer is required to predict better actions by supervising the expert behaviors. Meanwhile, the linear layer feeds the valuable sub-goals into the transformer module and associates them with the generated actions. During online testing, DM-H can automatically improve its performance in a trial-and-error manner without requiring gradient updates.
  • Figure 2: Results for Grid World. An agent is expected to solve a new task by interacting with the environments for 20 episodes without online model updates. Our DM-H significantly outperforms baselines on long-term tasks with sparse rewards because it inherits the merits of transformers and Mamba in high-quality prediction and long-term memory.
  • Figure 3: Results for (a) performance and (b) online testing times on Tmaze tasks. We train each method to address Tmaze tasks that have different horizons until we run out of GPU memory at context length to achieve 10k (DT, DM) or 20k (our DM-H). We report the online testing time for 20 episodes of Tmaze tasks.
  • Figure 4: (a) The ablation study on DM-H with or without valuable sub-goals. (b) The parameter sensitivity analysis of "$c$."
  • Figure 5: Results for offline training times on Tmaze tasks. We train each method to address Tmaze tasks that have different horizons until we run out of GPU memory at context length to achieve 10k (DT, DM) or 20k (DM-H). We report the training times for 10k gradient updates on Tmaze tasks.
  • ...and 1 more figures