Table of Contents
Fetching ...

Oryx: a Scalable Sequence Model for Many-Agent Coordination in Offline MARL

Claude Formanek, Omayma Mahjoub, Louay Ben Nessir, Sasha Abramowitz, Ruan de Kock, Wiem Khlifi, Daniel Rajaonarivonivelomanantsoa, Simon Du Toit, Arnol Fokam, Siddarth Singh, Ulrich Mbou Sob, Felix Chalumeau, Arnu Pretorius

TL;DR

Oryx tackles the core challenges of offline multi-agent reinforcement learning—extrapolation error and miscoordination—by marrying a scalable, retention-based sequence model (Sable) with implicit constraint Q-learning (ICQ) in an autoregressive, multi-agent policy. It introduces a dual-head decoder and a sequential ICQ update that conditions each agent’s policy on prior agents’ actions, enabling stable long-horizon coordination from logged data. Empirically, Oryx achieves state-of-the-art performance on the majority of offline MARL benchmarks (SMAC, RWARE, MAMuJoCo) and scales robustly to very large agent populations (up to 50 in Connector), outperforming both non-autoregressive and competing sequence-model baselines. The work also provides extensive datasets and code to support future research, and points to promising directions in offline-online hybrid settings and broader domain applicability for autoregressive policies.

Abstract

A key challenge in offline multi-agent reinforcement learning (MARL) is achieving effective many-agent multi-step coordination in complex environments. In this work, we propose Oryx, a novel algorithm for offline cooperative MARL to directly address this challenge. Oryx adapts the recently proposed retention-based architecture Sable and combines it with a sequential form of implicit constraint Q-learning (ICQ), to develop a novel offline autoregressive policy update scheme. This allows Oryx to solve complex coordination challenges while maintaining temporal coherence over long trajectories. We evaluate Oryx across a diverse set of benchmarks from prior works -- SMAC, RWARE, and Multi-Agent MuJoCo -- covering tasks of both discrete and continuous control, varying in scale and difficulty. Oryx achieves state-of-the-art performance on more than 80% of the 65 tested datasets, outperforming prior offline MARL methods and demonstrating robust generalisation across domains with many agents and long horizons. Finally, we introduce new datasets to push the limits of many-agent coordination in offline MARL, and demonstrate Oryx's superior ability to scale effectively in such settings.

Oryx: a Scalable Sequence Model for Many-Agent Coordination in Offline MARL

TL;DR

Oryx tackles the core challenges of offline multi-agent reinforcement learning—extrapolation error and miscoordination—by marrying a scalable, retention-based sequence model (Sable) with implicit constraint Q-learning (ICQ) in an autoregressive, multi-agent policy. It introduces a dual-head decoder and a sequential ICQ update that conditions each agent’s policy on prior agents’ actions, enabling stable long-horizon coordination from logged data. Empirically, Oryx achieves state-of-the-art performance on the majority of offline MARL benchmarks (SMAC, RWARE, MAMuJoCo) and scales robustly to very large agent populations (up to 50 in Connector), outperforming both non-autoregressive and competing sequence-model baselines. The work also provides extensive datasets and code to support future research, and points to promising directions in offline-online hybrid settings and broader domain applicability for autoregressive policies.

Abstract

A key challenge in offline multi-agent reinforcement learning (MARL) is achieving effective many-agent multi-step coordination in complex environments. In this work, we propose Oryx, a novel algorithm for offline cooperative MARL to directly address this challenge. Oryx adapts the recently proposed retention-based architecture Sable and combines it with a sequential form of implicit constraint Q-learning (ICQ), to develop a novel offline autoregressive policy update scheme. This allows Oryx to solve complex coordination challenges while maintaining temporal coherence over long trajectories. We evaluate Oryx across a diverse set of benchmarks from prior works -- SMAC, RWARE, and Multi-Agent MuJoCo -- covering tasks of both discrete and continuous control, varying in scale and difficulty. Oryx achieves state-of-the-art performance on more than 80% of the 65 tested datasets, outperforming prior offline MARL methods and demonstrating robust generalisation across domains with many agents and long horizons. Finally, we introduce new datasets to push the limits of many-agent coordination in offline MARL, and demonstrate Oryx's superior ability to scale effectively in such settings.

Paper Structure

This paper contains 34 sections, 14 equations, 10 figures, 16 tables, 1 algorithm.

Figures (10)

  • Figure 1: Oryx's model architecture. The green blocks indicate the inputs to the model (in yellow), sourced from the dataset of online experiences (in blue). First, a sequence of agent observations from timestep t to t+k is passed through the encoder. Inside each retention block, the network performs joint reasoning over the agents $(a_1, \dots, a_n)$ and temporal context $(t, \dots, t+k)$, producing encoded representations at each timestep. These encoded observations, along with the actions from the dataset, are passed to the decoder, which has two heads. One head returns Q-values, while the second returns a policy distribution for each agent for the full sequence.
  • Figure 2: Evaluating long horizon coordination. To issolate the importance of the different components of Oryx a minimal two-agent environment, T-Maze was designed. In the environment the target states are revealed only at the first timestep, requiring agents to retain goal information throughout the episode and carefully coordinate at the end. Oryx successfully solves the task only when all components are present, while baseline methods fail to perform across both the replay and expert datasets.
  • Figure 3: Evaluating Oryx on many-agent settings. We compare Oryx, with its autoregressive ICQ loss and sequence model architecture, to MAICQ which is a non-autoregressive CTDE algorithm. The two algorithms are trained on datasets from Connector jumanji scenarios with increasing numbers of agents. While the performance of MAICQ dramatically degrades on scenarios with large numbers of agents, Oryx's performance remains robust.
  • Figure 4: Performance of Oryx across diverse benchmark datasets from prior literature. Scores are normalised relative to the current state-of-the-art, with values above 1 indicating that Oryx surpasses previous best-known results. Unnormalized scores are provided in the appendix. Gold stars indicate instances where Oryx matches or exceeds state-of-the-art performance, while black stars denote otherwise.
  • Figure 5: Environment visualisation for Connector.
  • ...and 5 more figures

Theorems & Definitions (1)

  • proof