Table of Contents
Fetching ...

A Multi-Agent, Policy-Gradient approach to Network Routing

Nigel Tao, Jonathan Baxter, Lex Weaver

TL;DR

This paper reframes network routing as a multi-agent Pomdp and proposes Olpomdp, an online policy-gradient algorithm with Gibbs policies that enables routers to learn cooperative, distributed routing policies using only local information and a global reward signal. Through a series of simulations, the authors show that the approach yields both deterministic and mixed strategies that optimize long-term average travel time, and that reward shaping (e.g., penalizing cycles) accelerates convergence. They demonstrate robustness across simple and Braess-type networks, including resolving paradoxical scenarios where additional paths degrade performance. The work highlights the potential of policy-gradient reinforcement learning for scalable, model-free coordination in distributed routing and other multi-agent systems.

Abstract

Network routing is a distributed decision problem which naturally admits numerical performance measures, such as the average time for a packet to travel from source to destination. OLPOMDP, a policy-gradient reinforcement learning algorithm, was successfully applied to simulated network routing under a number of network models. Multiple distributed agents (routers) learned co-operative behavior without explicit inter-agent communication, and they avoided behavior which was individually desirable, but detrimental to the group's overall performance. Furthermore, shaping the reward signal by explicitly penalizing certain patterns of sub-optimal behavior was found to dramatically improve the convergence rate.

A Multi-Agent, Policy-Gradient approach to Network Routing

TL;DR

This paper reframes network routing as a multi-agent Pomdp and proposes Olpomdp, an online policy-gradient algorithm with Gibbs policies that enables routers to learn cooperative, distributed routing policies using only local information and a global reward signal. Through a series of simulations, the authors show that the approach yields both deterministic and mixed strategies that optimize long-term average travel time, and that reward shaping (e.g., penalizing cycles) accelerates convergence. They demonstrate robustness across simple and Braess-type networks, including resolving paradoxical scenarios where additional paths degrade performance. The work highlights the potential of policy-gradient reinforcement learning for scalable, model-free coordination in distributed routing and other multi-agent systems.

Abstract

Network routing is a distributed decision problem which naturally admits numerical performance measures, such as the average time for a packet to travel from source to destination. OLPOMDP, a policy-gradient reinforcement learning algorithm, was successfully applied to simulated network routing under a number of network models. Multiple distributed agents (routers) learned co-operative behavior without explicit inter-agent communication, and they avoided behavior which was individually desirable, but detrimental to the group's overall performance. Furthermore, shaping the reward signal by explicitly penalizing certain patterns of sub-optimal behavior was found to dramatically improve the convergence rate.

Paper Structure

This paper contains 10 sections, 8 equations, 8 figures.

Figures (8)

  • Figure 1: The triangle network.
  • Figure 2: Reward signal $r_t$ and probability $\mu^{\mathsf{A}}_{\mathsf{AB}}({\mathsf{C}})$ for Olpomdp on the triangle network.
  • Figure 3: The contention network.
  • Figure 4: Reward signal $r_t$ and probability $\mu^{\mathsf{A}} _{top\_link}({\mathsf{B}})$ for Olpomdp on the 2-node contention network.
  • Figure 5: The complete six-node network. All links have delay 1 and unlimited capacity.
  • ...and 3 more figures