Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network

Yao Kang; Xin Wang; Zhiling Lan

Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network

Yao Kang, Xin Wang, Zhiling Lan

TL;DR

This work tackles suboptimal Dragonfly routing arising from reliance on local congestion signals by introducing Q-adaptive routing, a fully distributed multi-agent reinforcement learning scheme. It uses a novel two-level Q-table to capture source-destination and intermediate-path information, enabling efficient learning with half the memory of traditional Q-routing and guaranteeing delivery within five hops. Implemented in SST/Merlin, Q-adaptive outperforms existing adaptive routing (UGAL and PAR) and, in some cases, even VALn under adversarial traffic, achieving up to 10.5% higher throughput and up to 5× lower average latency on 1k–2.5k node Dragonfly systems, with convergence within 500 μs. The results indicate strong scalability and practical potential for MARL-based routing on high-radix interconnects, motivating future investigations into application-driven behavior and inter-job interference mitigation.

Abstract

High-radix interconnects such as Dragonfly and its variants rely on adaptive routing to balance network traffic for optimum performance. Ideally, adaptive routing attempts to forward packets between minimal and non-minimal paths with the least congestion. In practice, current adaptive routing algorithms estimate routing path congestion based on local information such as output queue occupancy. Using local information to estimate global path congestion is inevitably inaccurate because a router has no precise knowledge of link states a few hops away. This inaccuracy could lead to interconnect congestion. In this study, we present Q-adaptive routing, a multi-agent reinforcement learning routing scheme for Dragonfly systems. Q-adaptive routing enables routers to learn to route autonomously by leveraging advanced reinforcement learning technology. The proposed Q-adaptive routing is highly scalable thanks to its fully distributed nature without using any shared information between routers. Furthermore, a new two-level Q-table is designed for Q-adaptive to make it computational lightly and saves 50% of router memory usage compared with the previous Q-routing. We implement the proposed Q-adaptive routing in SST/Merlin simulator. Our evaluation results show that Q-adaptive routing achieves up to 10.5% system throughput improvement and 5.2x average packet latency reduction compared with adaptive routing algorithms. Remarkably, Q-adaptive can even outperform the optimal VALn non-minimal routing under the ADV+1 adversarial traffic pattern with up to 3% system throughput improvement and 75% average packet latency reduction.

Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network

TL;DR

Abstract

Paper Structure (18 sections, 3 equations, 9 figures, 3 tables)

This paper contains 18 sections, 3 equations, 9 figures, 3 tables.

Introduction
Background and related work
Dragonfly Topology
Dragonfly Routing
Reinforcement Learning based Routing
Q-routing
Issues of Q-Routing on Dragonfly
Technical Challenges
Q-adaptive routing
Evaluation
Experimental Setup
Evaluation metric
Routing under Different Loads
Tail Latency
Convergence
...and 3 more sections

Figures (9)

Figure 1: Issue of existing adaptive routing on Dragonfly, where local information fails to estimate global path congestion. Although path2 is the least congested path, existing adaptive routing methods typically choose path1 over path2.
Figure 2: Dragonfly Topology
Figure 3: Local link congestion under ADV+4. In this case, G1 send packets to G5, G2 to G6, etc. When the packets are routed non-minimally, the local link between router 1 and router 2 in the intermediate group becomes bottleneck (red arrow).
Figure 4: Flow chart for Q-adaptive routing. Dest.G and Int.G stand for destination group and intermediate group respectively.
Figure 5: Q-adaptive on the 1056-node Dragonfly
...and 4 more figures

Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network

TL;DR

Abstract

Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network

Authors

TL;DR

Abstract

Table of Contents

Figures (9)