Table of Contents
Fetching ...

FNCC: Fast Notification Congestion Control in Data Center Networks

Jing Xu, Zhan Wang, Fan Yang, Ning Kang, Zhenlong Ma, Guojun Yuan, Guangming Tan, Ninghui Sun

TL;DR

FNCC addresses slow congestion response in data-center networks by delivering in-network telemetry via return-path ACKs, achieving sub-$RTT$ notification and faster rate adjustment. It adds a last-hop concurrency signal so receivers inform senders of concurrent congested flows, enabling quick convergence to a fair rate. The design comprises CP at switches, RP at senders, and ACK-generation at receivers, with practical hardware feasibility demonstrated and significant improvements in queue depth, pause-frame reduction, utilization, and flow-completion-time across 100–400 Gbps tests and large-scale simulations. This approach offers a practical, scalable enhancement to HPCC/DCQCN for RoCEv2-based data centers, reducing tail latency and improving fairness under diverse congestion scenarios.

Abstract

Congestion control plays a pivotal role in large-scale data centers, facilitating ultra-low latency, high bandwidth, and optimal utilization. Even with the deployment of data center congestion control mechanisms such as DCQCN and HPCC, these algorithms often respond to congestion sluggishly. This sluggishness is primarily due to the slow notification of congestion. It takes almost one round-trip time (RTT) for the congestion information to reach the sender. In this paper, we introduce the Fast Notification Congestion Control (FNCC) mechanism, which achieves sub-RTT notification. FNCC leverages the acknowledgment packet (ACK) from the return path to carry in-network telemetry (INT) information of the request path, offering the sender more timely and accurate INT. To further accelerate the responsiveness of last-hop congestion control, we propose that the receiver notifies the sender of the number of concurrent congested flows, which can be used to adjust the congested flows to a fair rate quickly. Our experimental results demonstrate that FNCC reduces flow completion time by 27.4% and 88.9% compared to HPCC and DCQCN, respectively. Moreover, FNCC triggers minimal pause frames and maintains high utilization even at 400Gbps.

FNCC: Fast Notification Congestion Control in Data Center Networks

TL;DR

FNCC addresses slow congestion response in data-center networks by delivering in-network telemetry via return-path ACKs, achieving sub- notification and faster rate adjustment. It adds a last-hop concurrency signal so receivers inform senders of concurrent congested flows, enabling quick convergence to a fair rate. The design comprises CP at switches, RP at senders, and ACK-generation at receivers, with practical hardware feasibility demonstrated and significant improvements in queue depth, pause-frame reduction, utilization, and flow-completion-time across 100–400 Gbps tests and large-scale simulations. This approach offers a practical, scalable enhancement to HPCC/DCQCN for RoCEv2-based data centers, reducing tail latency and improving fairness under diverse congestion scenarios.

Abstract

Congestion control plays a pivotal role in large-scale data centers, facilitating ultra-low latency, high bandwidth, and optimal utilization. Even with the deployment of data center congestion control mechanisms such as DCQCN and HPCC, these algorithms often respond to congestion sluggishly. This sluggishness is primarily due to the slow notification of congestion. It takes almost one round-trip time (RTT) for the congestion information to reach the sender. In this paper, we introduce the Fast Notification Congestion Control (FNCC) mechanism, which achieves sub-RTT notification. FNCC leverages the acknowledgment packet (ACK) from the return path to carry in-network telemetry (INT) information of the request path, offering the sender more timely and accurate INT. To further accelerate the responsiveness of last-hop congestion control, we propose that the receiver notifies the sender of the number of concurrent congested flows, which can be used to adjust the congested flows to a fair rate quickly. Our experimental results demonstrate that FNCC reduces flow completion time by 27.4% and 88.9% compared to HPCC and DCQCN, respectively. Moreover, FNCC triggers minimal pause frames and maintains high utilization even at 400Gbps.
Paper Structure (28 sections, 6 equations, 15 figures, 3 algorithms)

This paper contains 28 sections, 6 equations, 15 figures, 3 algorithms.

Figures (15)

  • Figure 1: (a), Hardware trends in NVIDIA's top data center switches. Switch capacity and link speeds have grown rapidly, but buffer sizes can not keep up with switch capacity growth. (b)$\sim$(d), Deeper queue lengths are observed across different link rates when applying HPCC and DCQCN than FNCC.
  • Figure 2: Notification scheme of HPCC and FNCC. Assuming congestion arises at switch1 at time t, the HPCC’s congestion information is not directly relayed to the sender.
  • Figure 3: The count of pause frames at the congestion point. The results indicate a higher number of pause frames generated by HPCC and DCQCN algorithms than FNCC at both 200Gbps and 400Gbps rates.
  • Figure 4: The framework of HPCC and FNCC design. HPCC needs to add In-Network Telemetry (INT) after each data packet. The target end-host generates ACK containing all INTs and sends them back to the sender. In FNCC, the switch only adds INT to the ACK. The ACK can reach the sender faster and can provide more timely INT information. To further speed up the last-hop congestion control, FNCC’s receiver notifies the sender of the number of concurrent congested flows (N). FNCC’s sender can quickly regulate the flow rate to a fair rate.
  • Figure 5: Symmetric route table. When the SDN controller builds the routing table, it should ensure that the data packet forwarding paths are in the same order as the ACK packet forwarding paths. Since the data packet and its corresponding ACK packet share the same five-tuple values, they will possess identical hash values. With a symmetric routing table in place, these packets will select the same path.
  • ...and 10 more figures