Decision Mamba Architectures

André Correia; Luís A. Alexandre

Decision Mamba Architectures

André Correia, Luís A. Alexandre

TL;DR

This work introduces two novel methods, Decision Mamba (DM) and Hierarchical Decision Mamba (HDM), aimed at enhancing the performance of the Transformer models, and demonstrates the superiority of Mamba models over their Transformer counterparts in a majority of tasks.

Abstract

Recent advancements in imitation learning have been largely fueled by the integration of sequence models, which provide a structured flow of information to effectively mimic task behaviours. Currently, Decision Transformer (DT) and subsequently, the Hierarchical Decision Transformer (HDT), presented Transformer-based approaches to learn task policies. Recently, the Mamba architecture has shown to outperform Transformers across various task domains. In this work, we introduce two novel methods, Decision Mamba (DM) and Hierarchical Decision Mamba (HDM), aimed at enhancing the performance of the Transformer models. Through extensive experimentation across diverse environments such as OpenAI Gym and D4RL, leveraging varying demonstration data sets, we demonstrate the superiority of Mamba models over their Transformer counterparts in a majority of tasks. Results show that DM outperforms other methods in most settings. The code can be found at https://github.com/meowatthemoon/DecisionMamba.

Decision Mamba Architectures

TL;DR

Abstract

Paper Structure (14 sections, 5 equations, 5 figures, 2 tables)

This paper contains 14 sections, 5 equations, 5 figures, 2 tables.

Introduction
Related Work
Preliminaries
Reinforcement Learning
Offline RL
Decision Transformer
Hierarchical Decision Transformer
Structured State Space Sequence Models
Methodology
Decision Mamba
Hierarchical Decision Mamba
Experiments
Time Comparison
Conclusion

Figures (5)

Figure 1: The DM architecture on the left and the HDM architecture on the right side. The DM is conditioned on the sequence of past states and actions to predict the correct action. The HDM is composed of two modules. The high-level mechanism guides the low-level controller through the task by selecting sub-goal states, based on the history of sub-goals and states. The low-level controller is conditioned on the history of past states, sub-goals, and actions to select the appropriate action.
Figure 2: Comparison of the performance of the HDM varying architecture configuration across the 7 D4RL tasks, for different demonstration data sets. The scale of the bar graphs is the maximum reward present in the respective data set. L is the number of layers, D is the embedding size, and K is the context length.
Figure 3: Comparison of the 5 methods across the 7 D4RL tasks, for different the demonstration data sets. The scale of the bar graphs is the maximum reward present in the respective data set. All models have 6 layers, an embedding size of 128 and use context length of 20. The values of DM and DT are obtained by using the maximum reward of the data set as the desired reward.
Figure 4: Comparison of the performance of the DM varying architecture configuration across the 7 D4RL tasks, for different demonstration data sets. The scale of the bar graphs is the maximum reward present in the respective data set. L is the number of layers, D is the embedding size, and K is the context length.
Figure 5: Comparison of the performance of the DM with the sequence of RTG, varying architecture configuration across the 7 D4RL tasks, for different demonstration data sets. The scale of the bar graphs is the maximum reward present in the respective data set. L is the number of layers, D is the embedding size, and K is the context length. The values are obtained by using the maximum reward of the data set as the desired reward.

Decision Mamba Architectures

TL;DR

Abstract

Decision Mamba Architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (5)