Sable: a Performant, Efficient and Scalable Sequence Model for MARL

Omayma Mahjoub; Sasha Abramowitz; Ruan de Kock; Wiem Khlifi; Simon du Toit; Jemma Daniel; Louay Ben Nessir; Louise Beyers; Claude Formanek; Liam Clark; Arnu Pretorius

Sable: a Performant, Efficient and Scalable Sequence Model for MARL

Omayma Mahjoub, Sasha Abramowitz, Ruan de Kock, Wiem Khlifi, Simon du Toit, Jemma Daniel, Louay Ben Nessir, Louise Beyers, Claude Formanek, Liam Clark, Arnu Pretorius

TL;DR

Sable addresses the need for high performance, memory efficiency, and scalability in cooperative MARL. It replaces attention with a MARL-adapted RetNet-based retention mechanism, enabling full-episode temporal reasoning with constant memory during inference and efficient training via chunking. Across 45 tasks in six environments, Sable achieves state-of-the-art performance in most tasks, scales to thousands of agents with linear memory growth, and demonstrates practical improvements over the Multi-Agent Transformer while maintaining memory efficiency. The work provides extensive experimental data, a new scalable evaluation environment, and public code to advance research in scalable, memory-efficient MARL.

Abstract

As multi-agent reinforcement learning (MARL) progresses towards solving larger and more complex problems, it becomes increasingly important that algorithms exhibit the key properties of (1) strong performance, (2) memory efficiency, and (3) scalability. In this work, we introduce Sable, a performant, memory-efficient, and scalable sequence modeling approach to MARL. Sable works by adapting the retention mechanism in Retentive Networks (Sun et al., 2023) to achieve computationally efficient processing of multi-agent observations with long context memory for temporal reasoning. Through extensive evaluations across six diverse environments, we demonstrate how Sable is able to significantly outperform existing state-of-the-art methods in a large number of diverse tasks (34 out of 45 tested). Furthermore, Sable maintains performance as we scale the number of agents, handling environments with more than a thousand agents while exhibiting a linear increase in memory usage. Finally, we conduct ablation studies to isolate the source of Sable's performance gains and confirm its efficient computational memory usage.

Sable: a Performant, Efficient and Scalable Sequence Model for MARL

TL;DR

Abstract

Paper Structure (84 sections, 18 equations, 16 figures, 16 tables, 1 algorithm)

This paper contains 84 sections, 18 equations, 16 figures, 16 tables, 1 algorithm.

Introduction
Background
Problem Formulation
Retention
Method
Execution
Training
Adapting the decay matrix for MARL
Scaling and efficient memory usage
Code
Experiments
Evaluation protocol
Environments
Hyperparameters
Performance
...and 69 more sections

Figures (16)

Figure 1: Performance, memory, and scaling properties of Sable compared to the Multi-Agent Transformer (MAT) mat, the previous state-of-the-art, aggregated over 45 cooperative MARL tasks.Left: Sable ranks best in 34 out of 45 tasks, outperforming all other MARL algorithms tested across 6 environments: RWARE, LBF, MABrax, SMAX, Connector, and MPE. MAT ranked best of 3/45. Middle: Sable exhibits superior throughput, processing up to 6.5 times more steps per second compared to MAT as we scale to 512 agents. Right: Sable scales efficiently to thousands of agents, maintaining stable performance, while using GPU memory significantly more efficiently than MAT.
Figure 2: Sable architecture and execution. The encoder receives all agent observations $o^1_t, ..., o^N_t$ from the current timestep $t$ along with a hidden state $h^{enc}_{t-1}$ representing past timesteps and produces encoded observations $\hat{o}^1_t, ..., \hat{o}^N_t$, observation-values $v(\hat{o}^1_t), ..., v(\hat{o}^N_t)$ and a new hidden state $h^{enc}_{t}$. The decoder performs recurrent retention over the current action $a_t^{m-1}$, followed by cross attention with the encoded observations, producing the next action $a_t^{m}$. The initial hidden states for recurrence over agents in the decoder at the current timestep are $(h^{dec_1}_{t-1}, h^{dec_2}_{t-1})$ and by the end of the decoding process, it generates the updated hidden states $(h^{dec_1}_{t}, h^{dec_2}_{t})$.
Figure 3: Sample efficiency curves and probability of improvement scores aggregated per environment suite. For each environment, results are aggregated over all tasks and the min--max normalized inter-quartile mean with 95% stratified bootstrap confidence intervals are shown. Inset plots indicate the overall aggregated probability of improvement for Sable compared to other baselines for that specific environment. A score of more than 0.5 where confidence intervals are also greater than 0.5 indicates statistically significant improvement over a baseline for a given environment agarwal2021deep.
Figure 4: Memory usage and agent scalability. When scaling to many agents, Sable is able to achieve superior converged performance while maintaining memory efficiency. MAT runs out of memory on Neom - 1024 agents and thus it's curve is omitted from (h).
Figure 5: Ablation studies on RWARE and SMAX.(a) Comparing Sable with MAT with modifications from Sable's implementation details. (b) Showing the relationship between chunk size, performance and memory usage on RWARE.
...and 11 more figures

Sable: a Performant, Efficient and Scalable Sequence Model for MARL

TL;DR

Abstract

Sable: a Performant, Efficient and Scalable Sequence Model for MARL

Authors

TL;DR

Abstract

Table of Contents

Figures (16)