Is Mamba Compatible with Trajectory Optimization in Offline Reinforcement Learning?

Yang Dai; Oubo Ma; Longfei Zhang; Xingxing Liang; Shengchao Hu; Mengzhu Wang; Shouling Ji; Jincai Huang; Li Shen

Is Mamba Compatible with Trajectory Optimization in Offline Reinforcement Learning?

Yang Dai, Oubo Ma, Longfei Zhang, Xingxing Liang, Shengchao Hu, Mengzhu Wang, Shouling Ji, Jincai Huang, Li Shen

TL;DR

This work introduces a Transformer-like DeMa, a specially designed DeMa that is compatible with trajectory optimization and surpasses previous methods, outperforming Decision Transformer with higher performance while using 30\% fewer parameters in Atari, and exceeding DT with only a quarter of the parameters in MuJoCo.

Abstract

Transformer-based trajectory optimization methods have demonstrated exceptional performance in offline Reinforcement Learning (offline RL). Yet, it poses challenges due to substantial parameter size and limited scalability, which is particularly critical in sequential decision-making scenarios where resources are constrained such as in robots and drones with limited computational power. Mamba, a promising new linear-time sequence model, offers performance on par with transformers while delivering substantially fewer parameters on long sequences. As it remains unclear whether Mamba is compatible with trajectory optimization, this work aims to conduct comprehensive experiments to explore the potential of Decision Mamba (dubbed DeMa) in offline RL from the aspect of data structures and essential components with the following insights: (1) Long sequences impose a significant computational burden without contributing to performance improvements since DeMa's focus on sequences diminishes approximately exponentially. Consequently, we introduce a Transformer-like DeMa as opposed to an RNN-like DeMa. (2) For the components of DeMa, we identify the hidden attention mechanism as a critical factor in its success, which can also work well with other residual structures and does not require position embedding. Extensive evaluations demonstrate that our specially designed DeMa is compatible with trajectory optimization and surpasses previous methods, outperforming Decision Transformer (DT) with higher performance while using 30\% fewer parameters in Atari, and exceeding DT with only a quarter of the parameters in MuJoCo.

Is Mamba Compatible with Trajectory Optimization in Offline Reinforcement Learning?

TL;DR

Abstract

Paper Structure (43 sections, 6 equations, 9 figures, 15 tables)

This paper contains 43 sections, 6 equations, 9 figures, 15 tables.

Introduction
Related Work
Offline RL.
Sequence Modeling in Offline RL.
Preliminaries
Offline RL with Trajectory Optimization
State Space Model and Mamba
Hidden Attention in Mamba
The Analysis of DeMa
Input Data Structures
How does sequence length affect the computational load?
How does sequence length affect the performance of DeMa?
Why does DeMa require merely short input sequences?
Which type of concatenation is suitable for DeMa?
The Essential Components of DeMa
...and 28 more sections

Figures (9)

Figure 1: Variant design of the DeMa in trajectory optimization. In the left portion, (I) represents the RNN-like DeMa (B3LD), which requires hidden state inputs at each decision step; (II) indicates the transformer-like DeMa (B3LD); and (III) refers to the transformer-like DeMa (BL3D). The right portion illustrates that both types of these DeMa can incorporate two distinct residual structures, i.e. the post up-projection residual block and the pre up-projection residual block.
Figure 2: The impact of sequence length on single-step forward computation time, single-step training time, and GPU memory usage. The sequence length of RNN-like DeMa is 1000.
Figure 3: Comparison of Transformer-like DeMa's Performance on Atari and MuJoCo Tasks. We report mean values averaged over 3 seeds, shaded areas represent deviations.
Figure 4: Hidden attention scores of DeMa from the 300th to the 600th timestep in Hopper-medium-replay. The X-axis represents timesteps from 300 to 600, the Y-axis represents the past $K$ tokens, and the Z-axis indicates the attention scores given to the $K$ tokens at the time of the current decision. More can be seen in Appendix \ref{['app:attention']}.
Figure 5: Normalized return after swapping the hidden attention of a single layer from another DeMa at a time. The black dashed line represents the evaluation results of the original model. "1", "2", and "3" represent the index of swap layers respectively, and "all" represents the result after swapping all parameters of the hidden attention. It can be seen that swapping the hidden attention has a significant impact on the results.
...and 4 more figures

Is Mamba Compatible with Trajectory Optimization in Offline Reinforcement Learning?

TL;DR

Abstract

Is Mamba Compatible with Trajectory Optimization in Offline Reinforcement Learning?

Authors

TL;DR

Abstract

Table of Contents

Figures (9)