Table of Contents
Fetching ...

Multi-agent In-context Coordination via Decentralized Memory Retrieval

Tao Jiang, Zichuan Lin, Lihe Li, Yi-Chen Li, Cong Guan, Lei Yuan, Zongzhang Zhang, Yang Yu, Deheng Ye

TL;DR

The paper tackles rapid coordination in cooperative Dec-POMDPs by leveraging in-context learning. It introduces MAICC, a framework that trains a centralized trajectory embedding model (CEM) and decentralized embeddings (DEMs) to retrieve task-relevant in-context trajectories, complemented by a memory mechanism that blends offline and online data with a hybrid credit-assignment score. Theoretical regret guarantees accompany empirical evidence showing faster adaptation on Level-Based Foraging and SMAC/SMACv2 benchmarks, outperforming baselines and ablations. This work advances sample-efficient, parameter-free adaptation in decentralized multi-agent systems and opens avenues for applying in-context techniques to complex MAS tasks.

Abstract

Large transformer models, trained on diverse datasets, have demonstrated impressive few-shot performance on previously unseen tasks without requiring parameter updates. This capability has also been explored in Reinforcement Learning (RL), where agents interact with the environment to retrieve context and maximize cumulative rewards, showcasing strong adaptability in complex settings. However, in cooperative Multi-Agent Reinforcement Learning (MARL), where agents must coordinate toward a shared goal, decentralized policy deployment can lead to mismatches in task alignment and reward assignment, limiting the efficiency of policy adaptation. To address this challenge, we introduce Multi-agent In-context Coordination via Decentralized Memory Retrieval (MAICC), a novel approach designed to enhance coordination by fast adaptation. Our method involves training a centralized embedding model to capture fine-grained trajectory representations, followed by decentralized models that approximate the centralized one to obtain team-level task information. Based on the learned embeddings, relevant trajectories are retrieved as context, which, combined with the agents' current sub-trajectories, inform decision-making. During decentralized execution, we introduce a novel memory mechanism that effectively balances test-time online data with offline memory. Based on the constructed memory, we propose a hybrid utility score that incorporates both individual- and team-level returns, ensuring credit assignment across agents. Extensive experiments on cooperative MARL benchmarks, including Level-Based Foraging (LBF) and SMAC (v1/v2), show that MAICC enables faster adaptation to unseen tasks compared to existing methods. Code is available at https://github.com/LAMDA-RL/MAICC.

Multi-agent In-context Coordination via Decentralized Memory Retrieval

TL;DR

The paper tackles rapid coordination in cooperative Dec-POMDPs by leveraging in-context learning. It introduces MAICC, a framework that trains a centralized trajectory embedding model (CEM) and decentralized embeddings (DEMs) to retrieve task-relevant in-context trajectories, complemented by a memory mechanism that blends offline and online data with a hybrid credit-assignment score. Theoretical regret guarantees accompany empirical evidence showing faster adaptation on Level-Based Foraging and SMAC/SMACv2 benchmarks, outperforming baselines and ablations. This work advances sample-efficient, parameter-free adaptation in decentralized multi-agent systems and opens avenues for applying in-context techniques to complex MAS tasks.

Abstract

Large transformer models, trained on diverse datasets, have demonstrated impressive few-shot performance on previously unseen tasks without requiring parameter updates. This capability has also been explored in Reinforcement Learning (RL), where agents interact with the environment to retrieve context and maximize cumulative rewards, showcasing strong adaptability in complex settings. However, in cooperative Multi-Agent Reinforcement Learning (MARL), where agents must coordinate toward a shared goal, decentralized policy deployment can lead to mismatches in task alignment and reward assignment, limiting the efficiency of policy adaptation. To address this challenge, we introduce Multi-agent In-context Coordination via Decentralized Memory Retrieval (MAICC), a novel approach designed to enhance coordination by fast adaptation. Our method involves training a centralized embedding model to capture fine-grained trajectory representations, followed by decentralized models that approximate the centralized one to obtain team-level task information. Based on the learned embeddings, relevant trajectories are retrieved as context, which, combined with the agents' current sub-trajectories, inform decision-making. During decentralized execution, we introduce a novel memory mechanism that effectively balances test-time online data with offline memory. Based on the constructed memory, we propose a hybrid utility score that incorporates both individual- and team-level returns, ensuring credit assignment across agents. Extensive experiments on cooperative MARL benchmarks, including Level-Based Foraging (LBF) and SMAC (v1/v2), show that MAICC enables faster adaptation to unseen tasks compared to existing methods. Code is available at https://github.com/LAMDA-RL/MAICC.

Paper Structure

This paper contains 31 sections, 3 theorems, 10 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Suppose $\sup_\mathcal{M} P(\mathcal{M})/P_{\mathcal{D}}(\mathcal{M}) \le C$ for some $C > 0$, where $P_{\mathcal{D}}(\mathcal{M})$ denotes the training task distribution. Then the expected online cumulative regret of MAICC satisfies $\mathbb{E}_{P(\mathcal{M})}[\mathbf{Reg}_\mathcal{M}] \le \tilde{

Figures (8)

  • Figure 1: The conceptual workflow of MAICC. Dashed lines show data flow during centralized training, where CEM samples offline trajectories for training and distills team information to DEMs. Solid lines show data flow during decentralized execution, where sub-trajectories retrieve trajectories from the constructed memory based on similarity and hybrid utility score. Blue $\circ$ and purple $\triangledown$ denote different embeddings output by CEM and DEMs, respectively. $\oplus$ denotes concatenation of retrieved trajectories with the current sequence, which helps decision models adapt quickly.
  • Figure 2: The illustration of CEM. Intra-team visibility enables observation and action tokens within the same team to attend to each other at each time step. The causal transformer predicts individual actions and rewards, while the post-step information token, concatenated with the previous individual observation, is used to predict the next observation.
  • Figure 3: In-context adaptation performance across different scenarios. Each scenario is evaluated over 50 test runs on randomly sampled tasks, with results reported as the mean return and 95% confidence interval.
  • Figure 4: Visualization results illustrating the effects of different embedding model training settings. Each point in the figure represents the embedding of a trajectory from the dataset, with points of the same color corresponding to trajectories from the same task.
  • Figure 5: Illustration of LBF: 9x9-20s. The agents are required to cooperate within a limited number of time steps to concurrently collect the food based on their local observations. The blue areas indicate the agents’ local fields of view, the yellow areas represent possible spawn locations for the food (each corresponding to a specific task), and the red apples denote the food positions included in the training tasks.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Lemma 1
  • proof
  • Theorem 2
  • proof