A Multi-Agent, Policy-Gradient approach to Network Routing
Nigel Tao, Jonathan Baxter, Lex Weaver
TL;DR
This paper reframes network routing as a multi-agent Pomdp and proposes Olpomdp, an online policy-gradient algorithm with Gibbs policies that enables routers to learn cooperative, distributed routing policies using only local information and a global reward signal. Through a series of simulations, the authors show that the approach yields both deterministic and mixed strategies that optimize long-term average travel time, and that reward shaping (e.g., penalizing cycles) accelerates convergence. They demonstrate robustness across simple and Braess-type networks, including resolving paradoxical scenarios where additional paths degrade performance. The work highlights the potential of policy-gradient reinforcement learning for scalable, model-free coordination in distributed routing and other multi-agent systems.
Abstract
Network routing is a distributed decision problem which naturally admits numerical performance measures, such as the average time for a packet to travel from source to destination. OLPOMDP, a policy-gradient reinforcement learning algorithm, was successfully applied to simulated network routing under a number of network models. Multiple distributed agents (routers) learned co-operative behavior without explicit inter-agent communication, and they avoided behavior which was individually desirable, but detrimental to the group's overall performance. Furthermore, shaping the reward signal by explicitly penalizing certain patterns of sub-optimal behavior was found to dramatically improve the convergence rate.
