Table of Contents
Fetching ...

Improving Injection-Throttling Mechanisms for Congestion Control for Data-center and Supercomputer Interconnects

Cristina Olmedilla, Jesus Escudero-Sahuquillo, Pedro J. Garcia, Francisco J. Quiles, Jose Duato

TL;DR

The paper addresses congestion control in high-performance interconnects for data centers and supercomputers, where growing AI/HPC workloads stress network buffers. It revisits the DCQCN mechanism and proposes three enhancements—Enhanced Congestion Point ($ECP$), Enhanced NP ($ENP$), and Enhanced RP ($ERP$)—to improve detection accuracy, signaling efficiency, and throttling precision. The authors implement these mechanisms in a simulation framework and demonstrate gains in throughput and reduced control traffic on a 64-node, 3-stage CLOS topology with 100 Gbps links. The results indicate improved victim-flow protection and fair sharing among congesting flows, addressing head-of-line blocking and avoiding unnecessary throttling.

Abstract

Over the past decade, Supercomputers and Data centers have evolved dramatically to cope with the increasing performance requirements of applications and services, such as scientific computing, generative AI, social networks or cloud services. This evolution have led these systems to incorporate high-speed networks using faster links, end nodes using multiple and dedicated accelerators, or a advancements in memory technologies to bridge the memory bottleneck. The interconnection network is a key element in these systems and it must be thoroughly designed so it is not the bottleneck of the entire system, bearing in mind the countless communication operations that generate current applications and services. Congestion is serious threat that spoils the interconnection network performance, and its effects are even more dramatic when looking at the traffic dynamics and bottlenecks generated by the communication operations mentioned above. In this vein, numerous congestion control (CC) techniques have been developed to address congestion negative effects. One popular example is Data Center Quantized Congestion Notification (DCQCN), which allows congestion detection at network switch buffers, then marking congesting packets and notifying about congestion to the sources, which finally apply injection throttling of those packets contributing to congestion. While DCQCN has been widely studied and improved, its main principles for congestion detection, notification and reaction remain largely unchanged, which is an important shortcoming considering congestion dynamics in current high-performance interconnection networks. In this paper, we revisit the DCQCN closed-loop mechanism and refine its design to leverage a more accurate congestion detection, signaling, and injection throttling, reducing control traffic overhead and avoiding unnecessary throttling of non-congesting flows.

Improving Injection-Throttling Mechanisms for Congestion Control for Data-center and Supercomputer Interconnects

TL;DR

The paper addresses congestion control in high-performance interconnects for data centers and supercomputers, where growing AI/HPC workloads stress network buffers. It revisits the DCQCN mechanism and proposes three enhancements—Enhanced Congestion Point (), Enhanced NP (), and Enhanced RP ()—to improve detection accuracy, signaling efficiency, and throttling precision. The authors implement these mechanisms in a simulation framework and demonstrate gains in throughput and reduced control traffic on a 64-node, 3-stage CLOS topology with 100 Gbps links. The results indicate improved victim-flow protection and fair sharing among congesting flows, addressing head-of-line blocking and avoiding unnecessary throttling.

Abstract

Over the past decade, Supercomputers and Data centers have evolved dramatically to cope with the increasing performance requirements of applications and services, such as scientific computing, generative AI, social networks or cloud services. This evolution have led these systems to incorporate high-speed networks using faster links, end nodes using multiple and dedicated accelerators, or a advancements in memory technologies to bridge the memory bottleneck. The interconnection network is a key element in these systems and it must be thoroughly designed so it is not the bottleneck of the entire system, bearing in mind the countless communication operations that generate current applications and services. Congestion is serious threat that spoils the interconnection network performance, and its effects are even more dramatic when looking at the traffic dynamics and bottlenecks generated by the communication operations mentioned above. In this vein, numerous congestion control (CC) techniques have been developed to address congestion negative effects. One popular example is Data Center Quantized Congestion Notification (DCQCN), which allows congestion detection at network switch buffers, then marking congesting packets and notifying about congestion to the sources, which finally apply injection throttling of those packets contributing to congestion. While DCQCN has been widely studied and improved, its main principles for congestion detection, notification and reaction remain largely unchanged, which is an important shortcoming considering congestion dynamics in current high-performance interconnection networks. In this paper, we revisit the DCQCN closed-loop mechanism and refine its design to leverage a more accurate congestion detection, signaling, and injection throttling, reducing control traffic overhead and avoiding unnecessary throttling of non-congesting flows.

Paper Structure

This paper contains 5 sections, 3 figures.

Figures (3)

  • Figure 1: 64 nodes interconnection network, CLOS Topology with 3 stages.
  • Figure 2: Network throughput for the 64-nodes scenario.
  • Figure 3: Network bandwidth per flow for the 64-nodes scenario.