Beyond Monotonicity: Revisiting Factorization Principles in Multi-Agent Q-Learning
Tianmeng Hu, Yongzheng Cui, Rui Tang, Biao Luo, Ke Li
TL;DR
This work addresses IGM consistency in multi-agent value function factorization by adopting a dynamical-systems lens to non-monotonic value decomposition under CTDE. It introduces a continuous-time gradient-flow analysis showing that, with approximately greedy exploration, only $IGM$-consistent fixed points are stable while $IGM$-inconsistent ones are unstable saddles, and validates this with matrix-game and MARL benchmarks. The methodology combines a non-monotonic mixing function, SARSA-style TD($\lambda$) targets, and Random Network Distillation-based intrinsic rewards to empirically realize the theoretical insights, achieving superior performance over monotonic baselines on SMAC and GRF. These results suggest that relaxing monotonic constraints, when paired with appropriate exploration and learning dynamics, can yield more expressive and effective value-based MARL algorithms.
Abstract
Value decomposition is a central approach in multi-agent reinforcement learning (MARL), enabling centralized training with decentralized execution by factorizing the global value function into local values. To ensure individual-global-max (IGM) consistency, existing methods either enforce monotonicity constraints, which limit expressive power, or adopt softer surrogates at the cost of algorithmic complexity. In this work, we present a dynamical systems analysis of non-monotonic value decomposition, modeling learning dynamics as continuous-time gradient flow. We prove that, under approximately greedy exploration, all zero-loss equilibria violating IGM consistency are unstable saddle points, while only IGM-consistent solutions are stable attractors of the learning dynamics. Extensive experiments on both synthetic matrix games and challenging MARL benchmarks demonstrate that unconstrained, non-monotonic factorization reliably recovers IGM-optimal solutions and consistently outperforms monotonic baselines. Additionally, we investigate the influence of temporal-difference targets and exploration strategies, providing actionable insights for the design of future value-based MARL algorithms.
