A Deep Reinforcement Learning Framework for Optimizing Congestion Control in Data Centers

Shiva Ketabi; Hongkai Chen; Haiwei Dong; Yashar Ganjali

A Deep Reinforcement Learning Framework for Optimizing Congestion Control in Data Centers

Shiva Ketabi, Hongkai Chen, Haiwei Dong, Yashar Ganjali

TL;DR

This work uses multi-agent reinforcement learning to design a system for dynamic tuning of congestion control parameters at end-hosts in a data center that has the potential to mitigate the problems of static parameters.

Abstract

Various congestion control protocols have been designed to achieve high performance in different network environments. Modern online learning solutions that delegate the congestion control actions to a machine cannot properly converge in the stringent time scales of data centers. We leverage multiagent reinforcement learning to design a system for dynamic tuning of congestion control parameters at end-hosts in a data center. The system includes agents at the end-hosts to monitor and report the network and traffic states, and agents to run the reinforcement learning algorithm given the states. Based on the state of the environment, the system generates congestion control parameters that optimize network performance metrics such as throughput and latency. As a case study, we examine BBR, an example of a prominent recently-developed congestion control protocol. Our experiments demonstrate that the proposed system has the potential to mitigate the problems of static parameters.

A Deep Reinforcement Learning Framework for Optimizing Congestion Control in Data Centers

TL;DR

Abstract

Paper Structure (15 sections, 5 equations, 4 figures, 2 algorithms)

This paper contains 15 sections, 5 equations, 4 figures, 2 algorithms.

Introduction
Related Work
Traditional Congestion Control
Reinforcement Learning for Congestion Control
System Architecture
Design Principals and Considerations
System Design
Problem Formulation
Optimization Problem
Deep Reinforcement Learning Controller
System Prototyping and Evaluation
Case Study: BBR
Proof-of-Concept
Experimental Results
Conclusion

Figures (4)

Figure 1: The system architecture and components. On the top left, an example of a data center topology is illustrated as our environment. In the proposed architecture, host servers are added to the end-hosts and are connected to a number of RL agents via a communication channel. On the right side, we focus on the communication between the components of the host servers with the RL agents, and the details of the RL formulation. On the bottom left, the figure zooms in on the network between the multiple RL agents.
Figure 2: BBR case study.
Figure 3: Comparison of the vanilla BBR, PPO-BBR, DQN-BBR, and A2C-BBR in terms of (a) estimated RTT, (b) CDF of the error between the estimated RTT and latency, (c) throughput, and (d) CDF of the throughput.
Figure 4: Multiple flows joining and leaving. Convergence of throughput and RTT for (a) the vanilla BBR and (b) PPO-BBR.

A Deep Reinforcement Learning Framework for Optimizing Congestion Control in Data Centers

TL;DR

Abstract

A Deep Reinforcement Learning Framework for Optimizing Congestion Control in Data Centers

Authors

TL;DR

Abstract

Table of Contents

Figures (4)