Table of Contents
Fetching ...

Credence: Augmenting Datacenter Switch Buffer Sharing with ML Predictions

Vamsi Addanki, Maciej Pacut, Stefan Schmid

TL;DR

Credence addresses the pressure of shrinking per-port buffers in datacenter switches by augmenting a practical drop-tail buffer sharing approach with machine-learned predictions. By combining queue-length thresholds with predictions of push-out-equivalent behavior, Credence can emulate Longest Queue Drop (LQD) performance when predictions are perfect, while guaranteeing at least the baseline Complete Sharing behavior under poor predictions, with a smooth degradation as prediction error grows. The paper provides formal competitive-ratio guarantees, showing a bound of min(1.707 · η, N) on the throughput competitive ratio, and demonstrates substantial empirical gains (up to 1.5x throughput and up to 95% improvement in flow completion times) on realistic datacenter workloads using NS3 simulations and a lightweight RF predictor. This work offers a practical, hardware-conscious path toward leveraging predictions in network dataplanes and outlines concrete future directions for both systems and theory to enhance buffer sharing in congested datacenter environments.

Abstract

Packet buffers in datacenter switches are shared across all the switch ports in order to improve the overall throughput. The trend of shrinking buffer sizes in datacenter switches makes buffer sharing extremely challenging and a critical performance issue. Literature suggests that push-out buffer sharing algorithms have significantly better performance guarantees compared to drop-tail algorithms. Unfortunately, switches are unable to benefit from these algorithms due to lack of support for push-out operations in hardware. Our key observation is that drop-tail buffers can emulate push-out buffers if the future packet arrivals are known ahead of time. This suggests that augmenting drop-tail algorithms with predictions about the future arrivals has the potential to significantly improve performance. This paper is the first research attempt in this direction. We propose Credence, a drop-tail buffer sharing algorithm augmented with machine-learned predictions. Credence can unlock the performance only attainable by push-out algorithms so far. Its performance hinges on the accuracy of predictions. Specifically, Credence achieves near-optimal performance of the best known push-out algorithm LQD (Longest Queue Drop) with perfect predictions, but gracefully degrades to the performance of the simplest drop-tail algorithm Complete Sharing when the prediction error gets arbitrarily worse. Our evaluations show that Credence improves throughput by $1.5$x compared to traditional approaches. In terms of flow completion times, we show that Credence improves upon the state-of-the-art approaches by up to $95\%$ using off-the-shelf machine learning techniques that are also practical in today's hardware. We believe this work opens several interesting future work opportunities both in systems and theory that we discuss at the end of this paper.

Credence: Augmenting Datacenter Switch Buffer Sharing with ML Predictions

TL;DR

Credence addresses the pressure of shrinking per-port buffers in datacenter switches by augmenting a practical drop-tail buffer sharing approach with machine-learned predictions. By combining queue-length thresholds with predictions of push-out-equivalent behavior, Credence can emulate Longest Queue Drop (LQD) performance when predictions are perfect, while guaranteeing at least the baseline Complete Sharing behavior under poor predictions, with a smooth degradation as prediction error grows. The paper provides formal competitive-ratio guarantees, showing a bound of min(1.707 · η, N) on the throughput competitive ratio, and demonstrates substantial empirical gains (up to 1.5x throughput and up to 95% improvement in flow completion times) on realistic datacenter workloads using NS3 simulations and a lightweight RF predictor. This work offers a practical, hardware-conscious path toward leveraging predictions in network dataplanes and outlines concrete future directions for both systems and theory to enhance buffer sharing in congested datacenter environments.

Abstract

Packet buffers in datacenter switches are shared across all the switch ports in order to improve the overall throughput. The trend of shrinking buffer sizes in datacenter switches makes buffer sharing extremely challenging and a critical performance issue. Literature suggests that push-out buffer sharing algorithms have significantly better performance guarantees compared to drop-tail algorithms. Unfortunately, switches are unable to benefit from these algorithms due to lack of support for push-out operations in hardware. Our key observation is that drop-tail buffers can emulate push-out buffers if the future packet arrivals are known ahead of time. This suggests that augmenting drop-tail algorithms with predictions about the future arrivals has the potential to significantly improve performance. This paper is the first research attempt in this direction. We propose Credence, a drop-tail buffer sharing algorithm augmented with machine-learned predictions. Credence can unlock the performance only attainable by push-out algorithms so far. Its performance hinges on the accuracy of predictions. Specifically, Credence achieves near-optimal performance of the best known push-out algorithm LQD (Longest Queue Drop) with perfect predictions, but gracefully degrades to the performance of the simplest drop-tail algorithm Complete Sharing when the prediction error gets arbitrarily worse. Our evaluations show that Credence improves throughput by x compared to traditional approaches. In terms of flow completion times, we show that Credence improves upon the state-of-the-art approaches by up to using off-the-shelf machine learning techniques that are also practical in today's hardware. We believe this work opens several interesting future work opportunities both in systems and theory that we discuss at the end of this paper.
Paper Structure (24 sections, 7 theorems, 15 equations, 16 figures, 1 table, 2 algorithms)

This paper contains 24 sections, 7 theorems, 15 equations, 16 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

The total number of packets transmitted by Credence for an arrival sequence $\sigma$, a drop sequence $\phi$ by LQD and the predicted drop sequence $\phi^\prime$ is given by

Figures (16)

  • Figure 1: Augmenting drop-tail buffer sharing with ML predictions has the potential to significantly improve throughput compared to the best possible drop-tail algorithm (without predictions), and unlock the performance that was only attainable by push-out so far.
  • Figure 2: The switch has a buffer size of $B$ shared across $N$ output ports. Each color indicates the packets residing in the shared buffer corresponding to each port. A buffer sharing algorithm takes decisions (accept or drop) for each input packet.
  • Figure 3: Upon a large burst arrival, a typical drop-tail algorithm (ALG) proactively drops the incoming packets in anticipation of future bursts and significantly under-utilizes the buffer. In this case, an optimal offline algorithm accepts the entire burst without any packet drops.
  • Figure 4: In pursuit of high burst absorption, a drop-tail algorithm ALG may absorb bursts but this results in excessive reactive drops for the future packet arrivals. In this case, an optimal offline algorithm OPT drops few packets such that the overall throughput is maximized.
  • Figure 5: Confusion matrix for our prediction model.
  • ...and 11 more figures

Theorems & Definitions (17)

  • Definition 1: Error function
  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Definition 2: Preemptive buffer sharing
  • Definition 3: Non-preemptive buffer sharing
  • Definition 4: Competitive ratio
  • proof
  • Definition 4: Error function
  • Theorem 1
  • ...and 7 more