Table of Contents
Fetching ...

Mamba as Decision Maker: Exploring Multi-scale Sequence Modeling in Offline Reinforcement Learning

Jiahang Cao, Qiang Zhang, Ziqing Wang, Jingkai Sun, Jiaxu Wang, Hao Cheng, Yecheng Shao, Wen Zhao, Gang Han, Yijie Guo, Renjing Xu

TL;DR

The paper tackles offline reinforcement learning by addressing the multi-scale structure of RL trajectories. It introduces MambaDM, a decision maker based on the Global-local Fusion Mamba (GLoMa) mixer, which jointly models local sub-sequences and global context using return-to-go conditioning. Empirical results show state-of-the-art performance on Atari and OpenAI Gym, and a scaling-law analysis reveals that dataset size can outperform merely enlarging the model. An eigenvalue analysis of the Mamba core matrices provides insight into how global and local branches balance long-range and short-range dependencies. Overall, the work demonstrates robust, efficient sequence modeling for offline RL and emphasizes data collection as a key factor for performance gains.

Abstract

Sequential modeling has demonstrated remarkable capabilities in offline reinforcement learning (RL), with Decision Transformer (DT) being one of the most notable representatives, achieving significant success. However, RL trajectories possess unique properties to be distinguished from the conventional sequence (e.g., text or audio): (1) local correlation, where the next states in RL are theoretically determined solely by current states and actions based on the Markov Decision Process (MDP), and (2) global correlation, where each step's features are related to long-term historical information due to the time-continuous nature of trajectories. In this paper, we propose a novel action sequence predictor, named Mamba Decision Maker (MambaDM), where Mamba is expected to be a promising alternative for sequence modeling paradigms, owing to its efficient modeling of multi-scale dependencies. In particular, we introduce a novel mixer module that proficiently extracts and integrates both global and local features of the input sequence, effectively capturing interrelationships in RL datasets. Extensive experiments demonstrate that MambaDM achieves state-of-the-art performance in Atari and OpenAI Gym datasets. Furthermore, we empirically investigate the scaling laws of MambaDM, finding that increasing model size does not bring performance improvement, but scaling the dataset amount by 2x for MambaDM can obtain up to 33.7% score improvement on Atari dataset. This paper delves into the sequence modeling capabilities of MambaDM in the RL domain, paving the way for future advancements in robust and efficient decision-making systems.

Mamba as Decision Maker: Exploring Multi-scale Sequence Modeling in Offline Reinforcement Learning

TL;DR

The paper tackles offline reinforcement learning by addressing the multi-scale structure of RL trajectories. It introduces MambaDM, a decision maker based on the Global-local Fusion Mamba (GLoMa) mixer, which jointly models local sub-sequences and global context using return-to-go conditioning. Empirical results show state-of-the-art performance on Atari and OpenAI Gym, and a scaling-law analysis reveals that dataset size can outperform merely enlarging the model. An eigenvalue analysis of the Mamba core matrices provides insight into how global and local branches balance long-range and short-range dependencies. Overall, the work demonstrates robust, efficient sequence modeling for offline RL and emphasizes data collection as a key factor for performance gains.

Abstract

Sequential modeling has demonstrated remarkable capabilities in offline reinforcement learning (RL), with Decision Transformer (DT) being one of the most notable representatives, achieving significant success. However, RL trajectories possess unique properties to be distinguished from the conventional sequence (e.g., text or audio): (1) local correlation, where the next states in RL are theoretically determined solely by current states and actions based on the Markov Decision Process (MDP), and (2) global correlation, where each step's features are related to long-term historical information due to the time-continuous nature of trajectories. In this paper, we propose a novel action sequence predictor, named Mamba Decision Maker (MambaDM), where Mamba is expected to be a promising alternative for sequence modeling paradigms, owing to its efficient modeling of multi-scale dependencies. In particular, we introduce a novel mixer module that proficiently extracts and integrates both global and local features of the input sequence, effectively capturing interrelationships in RL datasets. Extensive experiments demonstrate that MambaDM achieves state-of-the-art performance in Atari and OpenAI Gym datasets. Furthermore, we empirically investigate the scaling laws of MambaDM, finding that increasing model size does not bring performance improvement, but scaling the dataset amount by 2x for MambaDM can obtain up to 33.7% score improvement on Atari dataset. This paper delves into the sequence modeling capabilities of MambaDM in the RL domain, paving the way for future advancements in robust and efficient decision-making systems.
Paper Structure (19 sections, 7 equations, 4 figures, 7 tables)

This paper contains 19 sections, 7 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of our proposed MambaDM. The input RL trajectory is first processed by the embedding layers, and these embeddings are then passed through both local and global branches to extract multi-scale features. Subsequently, the combined information from these branches is fed into a feed-forward network (FFN). After passing through $N$ layers, the final action sequence is obtained by the action predictor.
  • Figure 2: Visualization of the scaling factors impact on MambaDM's performances in reinforcement learning tasks, where color red and blue denote different RL domains. We find that MambaDM does not demonstrate a definitive scaling law when scaling the model size, but increasing the dataset size can significantly improve the model's performance.
  • Figure 3: Visualization of RL benchmarks, including Atari Games (a)-(c) and D4RL domains (d)-(f).
  • Figure 4: Visualization of the eigenvalues in the core matrices $\bm{A}$ of our proposed GLoMA, including eigenvalues in (a) global mamba and (b) local mamba matrices.