Table of Contents
Fetching ...

Adaptive Value Decomposition: Coordinating a Varying Number of Agents in Urban Systems

Yexin Li, Jinjin Guo, Haoyu Zhang, Yuhan Zhao, Yiwen Sun, Zihao Jiao

TL;DR

Adaptive Value Decomposition (AVD) tackles the key challenges of urban MAS with fluctuating agent counts and heterogeneous action durations by combining a hypernetwork-driven, variable-agent-value decomposition with a monotonic mixing network. The approach enables centralized training and decentralized execution (CTDE) in a semi-MARL setting, incorporating a lightweight homogenization-mitigation module to preserve coordination while maintaining policy sharing. Empirical evaluation on real-world bike-sharing data from London and Washington, D.C. demonstrates that AVD outperforms state-of-the-art baselines across static and dynamic agent populations, with strong zero-shot generalization to unseen agent counts. The work delivers a scalable benchmark and a training-execution strategy tailored for semi-MARL in urban systems, offering practical implications for real-time resource redistribution and multi-agent coordination in dynamic environments.

Abstract

Multi-agent reinforcement learning (MARL) provides a promising paradigm for coordinating multi-agent systems (MAS). However, most existing methods rely on restrictive assumptions, such as a fixed number of agents and fully synchronous action execution. These assumptions are often violated in urban systems, where the number of active agents varies over time, and actions may have heterogeneous durations, resulting in a semi-MARL setting. Moreover, while sharing policy parameters among agents is commonly adopted to improve learning efficiency, it can lead to highly homogeneous actions when a subset of agents make decisions concurrently under similar observations, potentially degrading coordination quality. To address these challenges, we propose Adaptive Value Decomposition (AVD), a cooperative MARL framework that adapts to a dynamically changing agent population. AVD further incorporates a lightweight mechanism to mitigate action homogenization induced by shared policies, thereby encouraging behavioral diversity and maintaining effective cooperation among agents. In addition, we design a training-execution strategy tailored to the semi-MARL setting that accommodates asynchronous decision-making when some agents act at different times. Experiments on real-world bike-sharing redistribution tasks in two major cities, London and Washington, D.C., demonstrate that AVD outperforms state-of-the-art baselines, confirming its effectiveness and generalizability.

Adaptive Value Decomposition: Coordinating a Varying Number of Agents in Urban Systems

TL;DR

Adaptive Value Decomposition (AVD) tackles the key challenges of urban MAS with fluctuating agent counts and heterogeneous action durations by combining a hypernetwork-driven, variable-agent-value decomposition with a monotonic mixing network. The approach enables centralized training and decentralized execution (CTDE) in a semi-MARL setting, incorporating a lightweight homogenization-mitigation module to preserve coordination while maintaining policy sharing. Empirical evaluation on real-world bike-sharing data from London and Washington, D.C. demonstrates that AVD outperforms state-of-the-art baselines across static and dynamic agent populations, with strong zero-shot generalization to unseen agent counts. The work delivers a scalable benchmark and a training-execution strategy tailored for semi-MARL in urban systems, offering practical implications for real-time resource redistribution and multi-agent coordination in dynamic environments.

Abstract

Multi-agent reinforcement learning (MARL) provides a promising paradigm for coordinating multi-agent systems (MAS). However, most existing methods rely on restrictive assumptions, such as a fixed number of agents and fully synchronous action execution. These assumptions are often violated in urban systems, where the number of active agents varies over time, and actions may have heterogeneous durations, resulting in a semi-MARL setting. Moreover, while sharing policy parameters among agents is commonly adopted to improve learning efficiency, it can lead to highly homogeneous actions when a subset of agents make decisions concurrently under similar observations, potentially degrading coordination quality. To address these challenges, we propose Adaptive Value Decomposition (AVD), a cooperative MARL framework that adapts to a dynamically changing agent population. AVD further incorporates a lightweight mechanism to mitigate action homogenization induced by shared policies, thereby encouraging behavioral diversity and maintaining effective cooperation among agents. In addition, we design a training-execution strategy tailored to the semi-MARL setting that accommodates asynchronous decision-making when some agents act at different times. Experiments on real-world bike-sharing redistribution tasks in two major cities, London and Washington, D.C., demonstrate that AVD outperforms state-of-the-art baselines, confirming its effectiveness and generalizability.
Paper Structure (29 sections, 13 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 29 sections, 13 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of AVD, where $v, j, k \in \mathcal{V}_\tau$ denote active agents. Panel (b) illustrates the overall architecture of AVD. The current state $s_{\tau}$ is input to the shared agent network, as depicted in panel (c), which outputs the action-values $Q_{v}(s_{\tau}, a_{\tau}^{v}), Q_{j}(s_{\tau}, a_{\tau}^{j}), ..., Q_{k}(s_{\tau}, a_{\tau}^{k})$ for each agent under their selected actions $a_{\tau}^{v}, a_{\tau}^{j}, ..., a_{\tau}^{k}$. These action-values, along with the state $s_{\tau}$, are then fed into the mixing network, as shown in panel (a). The mixing network generates the weights $w_{v}, w_{j}, ..., w_{k}$ and biases $b_{v}, b_{j}, ..., b_{k}$ for the agents, and computes the total action-value using Eq. \ref{['eq:total_value']}.
  • Figure 2: Daily cumulative net outflows remain minimal relative to cumulative trip volumes. Both bike rentals and net outflows are normalized by the maximum rental count across the regions to highlight the scale difference between the two measures.
  • Figure 3: Zero-shot performance of AVD trained with $|\mathcal{V}_\tau| \equiv 4$ and directly deployed with $|\mathcal{V}_\tau| \equiv 3$ without retraining, compared to AVD and VDN+ trained with $|\mathcal{V}_\tau| \equiv 3$ and OPT with $|\mathcal{V}_\tau| \equiv 3$. In this setting, the initial inventory is set to 30%, and the observed conclusions generalize to other configurations.
  • Figure 4: Ablation study of the lightweight mechanism for mitigating action homogenization. AVD w/o H denotes the variant of AVD without this mechanism. In this experiment, the number of agents is fixed at $|\mathcal{V}_\tau| \equiv 4$, and the observed conclusions generalize to other settings.
  • Figure 5: Map of London partitioned into 5 regions.
  • ...and 1 more figures