Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs

Benjamin Fuhrer; Yuval Shpigelman; Chen Tessler; Shie Mannor; Gal Chechik; Eitan Zahavi; Gal Dalal

Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs

Benjamin Fuhrer, Yuval Shpigelman, Chen Tessler, Shie Mannor, Gal Chechik, Eitan Zahavi, Gal Dalal

TL;DR

This work tackles congestion control in high-speed datacenter networks by replacing handcrafted heuristics with a reinforcement-learning CC policy that is distilled into a lightweight tree-based representation suitable for NIC deployment. The authors demonstrate a large inference-time reduction from neural networks to decision trees, enabling real-time operation on NVIDIA ConnectX NICs and a live 64-host deployment. The distilled policy maintains or improves key metrics—goodput, latency, and packet loss—across diverse traffic patterns, beating DCQCN and Swift in many scenarios. Additionally, the study provides insight into the learned decision logic and confirms the feasibility of data-driven CC on programmable NICs, challenging the prior view that hand-tuned heuristics are necessary.

Abstract

As communication protocols evolve, datacenter network utilization increases. As a result, congestion is more frequent, causing higher latency and packet loss. Combined with the increasing complexity of workloads, manual design of congestion control (CC) algorithms becomes extremely difficult. This calls for the development of AI approaches to replace the human effort. Unfortunately, it is currently not possible to deploy AI models on network devices due to their limited computational capabilities. Here, we offer a solution to this problem by building a computationally-light solution based on a recent reinforcement learning CC algorithm [arXiv:2207.02295]. We reduce the inference time of RL-CC by x500 by distilling its complex neural network into decision trees. This transformation enables real-time inference within the $μ$-sec decision-time requirement, with a negligible effect on quality. We deploy the transformed policy on NVIDIA NICs in a live cluster. Compared to popular CC algorithms used in production, RL-CC is the only method that performs well on all benchmarks tested over a large range of number of flows. It balances multiple metrics simultaneously: bandwidth, latency, and packet drops. These results suggest that data-driven methods for CC are feasible, challenging the prior belief that handcrafted heuristics are necessary to achieve optimal performance.

Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs

TL;DR

Abstract

-sec decision-time requirement, with a negligible effect on quality. We deploy the transformed policy on NVIDIA NICs in a live cluster. Compared to popular CC algorithms used in production, RL-CC is the only method that performs well on all benchmarks tested over a large range of number of flows. It balances multiple metrics simultaneously: bandwidth, latency, and packet drops. These results suggest that data-driven methods for CC are feasible, challenging the prior belief that handcrafted heuristics are necessary to achieve optimal performance.

Paper Structure (21 sections, 6 equations, 11 figures, 5 tables)

This paper contains 21 sections, 6 equations, 11 figures, 5 tables.

Introduction
Background and Problem Setup
Congestion Control
Existing State-of-the-Art for CC
Transmission Rate Modulation
Networking solutions based on AI
RL for CC
Design considerations for RL-CC
Deploying RL-CC
Limitations of Neural Networks in NICs
Network Quantization
Boosting Trees
Experiments
Live Cluster Setup
Many-to-One
...and 6 more sections

Figures (11)

Figure 1: An overview of the deployment process of RL-CC (reinforcement learning congestion control) in the real world. From left to right: (1) an RL policy is trained in simulation; (2) the neural network policy is distilled into a compute and memory efficient tree-based representation; and (3) the tree policy is deploy on ConnectX-6Dx NIC firmware and tested in a live datacenter with standard benchmark traffic patterns.
Figure 2: RL-CC training loop. Each flow is controlled by a different copy of the same agent, sharing the same logic across all flows but with its own local history. The agent interacts with the environment by multiplicative increment or decrement of the flow transmission rate (for visualization only we drew here a single flow per NIC). The environment feedback is the RTT measurement per flow.
Figure 3: RL-CC and Swift Theory vs. Practice: RTT inflation as a function of number of flows. Curved lines represent theoretical curves in the order of $\mathop{\mathrm{\mathcal{O}}}\nolimits(\sqrt{N}).$ We plot the average RTT inflation per flow with 99% vertical confidence intervals. Error bars are small initially and grow as the number of flows increase.
Figure 4: RL-CC parameter influence on the bandwidth/latency tradeoff. The plot on the top presents the tradeoff when varying $\beta$, whereas on the bottom the effects of target. We observe that while lower beta correlates with lower latency, the agent fails when $\beta$ is set too low. On the other hand, when target was set too low, the agent fails yet a value too high results in a dramatic increase in latency. We found that the optimal values are $\beta=1.5$ and $\text{target}=0.064$.
Figure 5: Model Distillation: An illustration of how we train the tree-based student policy $g$ to mimic the fixed NN-based policy $f$ by minimizing $L(y, g(x)) = \sqrt{\frac{1}{N}\sum_{n=1}^N(y_i - g(x_i))^2}$.
...and 6 more figures

Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs

TL;DR

Abstract

Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs

Authors

TL;DR

Abstract

Table of Contents

Figures (11)