Table of Contents
Fetching ...

SIRD: A Sender-Informed, Receiver-Driven Datacenter Transport Protocol

Konstantinos Prasopoulos, Ryan Kosta, Edouard Bugnion, Marios Kogias

TL;DR

SIRD tackles the datacenter congestion-control challenge under limited fabric buffering by separating management of exclusive receiver downlinks from shared links, and by using congestion feedback to coordinate credit across senders. The method combines proactive downlink scheduling with reactive handling of sender uplinks and core bottlenecks via two AIMD control loops and a two-bucket credit system, all implemented end-to-end on the Caladan stack and capable of 100Gbps in software. Key contributions include the concept of informed overcommitment, a detailed receiver-side credit/pacing algorithm, and extensive evaluation showing superior throughput, minimal buffering, and low tail latency compared with state-of-the-art RD and reactive protocols. The results suggest SIRD can deliver practical, high-performance datacenter transport without resorting to specialized switch features or in-network QoS, improving efficiency in modern, heterogeneous fabrics.

Abstract

Datacenter congestion control protocols are challenged to navigate the throughput-buffering trade-off while relative packet buffer capacity is trending lower year-over-year. In this context, receiver-driven protocols -- which schedule packet transmissions instead of reacting to congestion -- excel when the bottleneck lies at the ToR-to-receiver link. However, when multiple receivers must use a shared link (e.g., ToR to Spine), their independent schedules can conflict. We present SIRD, a receiver-driven congestion control protocol designed around the simple insight that single-owner links should be scheduled, while shared links should be managed with reactive control algorithms. The approach allows receivers to both precisely schedule their downlinks and to coordinate over shared bottlenecks. Critically, SIRD also treats sender uplinks as shared links, enabling the flow of congestion feedback from senders to receivers, which then adapt their scheduling to each sender's real-time capacity. This results in tight scheduling, enabling high bandwidth utilization with little contention, and thus minimal latency-inducing buffering in the fabric. We implement SIRD on top of the Caladan stack and show that SIRD's asymmetric design can deliver 100Gbps in software while keeping network queuing minimal. We further compare SIRD to state-of-the-art receiver-driven protocols (Homa, dcPIM, and ExpressPass) and production-grade reactive protocols (Swift and DCTCP) and show that SIRD is uniquely able to simultaneously maximize link utilization, minimize queuing, and obtain near-optimal latency.

SIRD: A Sender-Informed, Receiver-Driven Datacenter Transport Protocol

TL;DR

SIRD tackles the datacenter congestion-control challenge under limited fabric buffering by separating management of exclusive receiver downlinks from shared links, and by using congestion feedback to coordinate credit across senders. The method combines proactive downlink scheduling with reactive handling of sender uplinks and core bottlenecks via two AIMD control loops and a two-bucket credit system, all implemented end-to-end on the Caladan stack and capable of 100Gbps in software. Key contributions include the concept of informed overcommitment, a detailed receiver-side credit/pacing algorithm, and extensive evaluation showing superior throughput, minimal buffering, and low tail latency compared with state-of-the-art RD and reactive protocols. The results suggest SIRD can deliver practical, high-performance datacenter transport without resorting to specialized switch features or in-network QoS, improving efficiency in modern, heterogeneous fabrics.

Abstract

Datacenter congestion control protocols are challenged to navigate the throughput-buffering trade-off while relative packet buffer capacity is trending lower year-over-year. In this context, receiver-driven protocols -- which schedule packet transmissions instead of reacting to congestion -- excel when the bottleneck lies at the ToR-to-receiver link. However, when multiple receivers must use a shared link (e.g., ToR to Spine), their independent schedules can conflict. We present SIRD, a receiver-driven congestion control protocol designed around the simple insight that single-owner links should be scheduled, while shared links should be managed with reactive control algorithms. The approach allows receivers to both precisely schedule their downlinks and to coordinate over shared bottlenecks. Critically, SIRD also treats sender uplinks as shared links, enabling the flow of congestion feedback from senders to receivers, which then adapt their scheduling to each sender's real-time capacity. This results in tight scheduling, enabling high bandwidth utilization with little contention, and thus minimal latency-inducing buffering in the fabric. We implement SIRD on top of the Caladan stack and show that SIRD's asymmetric design can deliver 100Gbps in software while keeping network queuing minimal. We further compare SIRD to state-of-the-art receiver-driven protocols (Homa, dcPIM, and ExpressPass) and production-grade reactive protocols (Swift and DCTCP) and show that SIRD is uniquely able to simultaneously maximize link utilization, minimize queuing, and obtain near-optimal latency.
Paper Structure (23 sections, 3 equations, 31 figures, 5 tables, 2 algorithms)

This paper contains 23 sections, 3 equations, 31 figures, 5 tables, 2 algorithms.

Figures (31)

  • Figure 1: Homa queuing CDFs under various network loads for workload Websearch pfabric. The dotted lines represent the switch buffer size adjusted to the actual radix of our simulated ToR; see \ref{['sec:eval-methodology']}.
  • Figure 2: Mean buffering at ToRs versus maximum achieved goodput when sweeping the overcommitment parameter for SIRD (informed overcommitment) and Homa (controlled overcommitment). Results obtained in simulation by running the Websearch workload at 95Gbps on 100Gbps links across 144 servers; see \ref{['sec:eval-methodology']}.
  • Figure 3: Incast: CDF of message latency under incast compared to an unloaded baseline for 8B requests (left) and 500KB requests (right). The incast is formed by six senders transmitting 10MB messages in an open loop.
  • Figure 4: Left: credit accumulated at congested sender. Right: sum of credit available at the three receivers - initial total: $4\times 1.5 = 4.5\times BDP$. 100ms moving average. The circled numbers indicate the number of receivers at that stage of the experiment.
  • Figure 5: Normalized goodput, queuing, and slowdown across all 9 configurations. Each metric is normalized based on the best-performing protocol for the given metric and configuration. For queuing and slowdown, lower is better. For goodput, higher is better. Normalized slowdown and buffering are capped at $10\times$ and $200\times$, respectively, and higher values are plotted in the overflow area. The numbers in parentheses show the number of unstable configurations for each protocol which are not plotted. X-axis jitter is added for visibility. Find the data in \ref{['tbl:normalized']}.
  • ...and 26 more figures