Table of Contents
Fetching ...

Adaptive Context Length Optimization with Low-Frequency Truncation for Multi-Agent Reinforcement Learning

Wenchang Duan, Yaoliang Yu, Jiwan He, Yi Shi

TL;DR

Addresses fixed large context lengths in MARL and proposes ACL-LFT that uses a central agent to adaptively choose context length and a Fourier-based low-frequency truncation to reduce redundancy. The central agent selects truncation length from a discrete set and uses multi-head attention to shape the reward signal, while the decentralized agents operate on filtered temporal information. The method is supported by theoretical results showing a long-term advantage of adaptive context length and by extensive experiments across PettingZoo, MiniGrid, GRF, and SMACv2, where it achieves state-of-the-art performance. This work advances scalable, robust MARL by enabling effective long-range dependencies without incurring prohibitive computation.

Abstract

Recently, deep multi-agent reinforcement learning (MARL) has demonstrated promising performance for solving challenging tasks, such as long-term dependencies and non-Markovian environments. Its success is partly attributed to conditioning policies on large fixed context length. However, such large fixed context lengths may lead to limited exploration efficiency and redundant information. In this paper, we propose a novel MARL framework to obtain adaptive and effective contextual information. Specifically, we design a central agent that dynamically optimizes context length via temporal gradient analysis, enhancing exploration to facilitate convergence to global optima in MARL. Furthermore, to enhance the adaptive optimization capability of the context length, we present an efficient input representation for the central agent, which effectively filters redundant information. By leveraging a Fourier-based low-frequency truncation method, we extract global temporal trends across decentralized agents, providing an effective and efficient representation of the MARL environment. Extensive experiments demonstrate that the proposed method achieves state-of-the-art (SOTA) performance on long-term dependency tasks, including PettingZoo, MiniGrid, Google Research Football (GRF), and StarCraft Multi-Agent Challenge v2 (SMACv2).

Adaptive Context Length Optimization with Low-Frequency Truncation for Multi-Agent Reinforcement Learning

TL;DR

Addresses fixed large context lengths in MARL and proposes ACL-LFT that uses a central agent to adaptively choose context length and a Fourier-based low-frequency truncation to reduce redundancy. The central agent selects truncation length from a discrete set and uses multi-head attention to shape the reward signal, while the decentralized agents operate on filtered temporal information. The method is supported by theoretical results showing a long-term advantage of adaptive context length and by extensive experiments across PettingZoo, MiniGrid, GRF, and SMACv2, where it achieves state-of-the-art performance. This work advances scalable, robust MARL by enabling effective long-range dependencies without incurring prohibitive computation.

Abstract

Recently, deep multi-agent reinforcement learning (MARL) has demonstrated promising performance for solving challenging tasks, such as long-term dependencies and non-Markovian environments. Its success is partly attributed to conditioning policies on large fixed context length. However, such large fixed context lengths may lead to limited exploration efficiency and redundant information. In this paper, we propose a novel MARL framework to obtain adaptive and effective contextual information. Specifically, we design a central agent that dynamically optimizes context length via temporal gradient analysis, enhancing exploration to facilitate convergence to global optima in MARL. Furthermore, to enhance the adaptive optimization capability of the context length, we present an efficient input representation for the central agent, which effectively filters redundant information. By leveraging a Fourier-based low-frequency truncation method, we extract global temporal trends across decentralized agents, providing an effective and efficient representation of the MARL environment. Extensive experiments demonstrate that the proposed method achieves state-of-the-art (SOTA) performance on long-term dependency tasks, including PettingZoo, MiniGrid, Google Research Football (GRF), and StarCraft Multi-Agent Challenge v2 (SMACv2).

Paper Structure

This paper contains 25 sections, 1 theorem, 62 equations, 5 figures, 9 tables, 1 algorithm.

Key Result

Theorem 1

: At time $t$, let $L_{\text{adap}}$ be the adaptive context length, $L_{\text{fix}}$ be the fixed context length, and the mutual information loss of $L$ be denoted as $\mathcal{L}_t(L)$. The expected cumulative reward difference between adaptive and fixed context length satisfies the following regr where $0 \leq \alpha < 1$, with $\alpha$ being a non-deterministic parameter whose formal definitio

Figures (5)

  • Figure 1: Schematics of our ACL-LFT. At each time $t$, the historical state $s_t^{-1}$ is first processed via the Fourier-based low-frequency truncation module. The central agent leverages the truncated information $s^c_t$ as input and then adaptively optimizes the context length. Subsequently, the decentralized agents then integrate the optimized contextual information $s_t^{-opt}$ with the current state to achieve decision-making.
  • Figure 2: Sample Spread (a) is a search game where agents learn to cover all the landmarks while avoiding collisions. Minigrid Soccer Game (b) is a 15×15 environment where agents (triangles) earn rewards by kicking the ball (circle) into same-colored goalmouths (squares). Academy 3 vs 1 with Keeper (c) is a scenario where three offensive agents attempt to score against one defender and a goalkeeper. Academy Counterattack-Hard (d) is a scenario where four agents must execute a rapid counterattack while avoiding defenders.
  • Figure 3: Performance Comparison with Sequence Processing Methods in Four Environments
  • Figure 4: Performance Comparison with Different Fixed Lengths on GRF
  • Figure 5: Comparison of ablation studies: (a) 3 vs 1 with Keeper; (b) Counterattack-Hard.

Theorems & Definitions (1)

  • Theorem 1: Long-Term Advantage Lower Bound of Adaptive Length