Table of Contents
Fetching ...

Characterizing TCP's Performance for Low-Priority Flows Inside a Cloud

Hafiz Mohsin Bashir, Abdullah Bin Faisal, Fahad R. Dogar

TL;DR

This paper investigates whether TCP is suitable for low-priority traffic in cloud environments that implement network prioritization. It uses an empirical study combining a small-scale testbed and NS3 simulations with a near-optimal Baseline (Near-Opt) to quantify TCP's performance across key use-cases, workloads, and configurations, revealing that TCP is often within a modest margin of Near-Opt for certain scheduling scenarios but can suffer extreme degradation under on-off high-priority traffic. The authors identify spurious timeouts and feedback-convergence issues as primary failure modes for low-priority flows and demonstrate that two simple mitigations—weighted fair queuing (WFQ) and cross-queue congestion notification (CQCN)—can substantially improve low-priority completion times. The work highlights broader implications for data center transport design and suggests directions for integrating stateful network information into end-host congestion control to improve performance under prioritization.

Abstract

Many cloud systems utilize low-priority flows to achieve various performance objectives (e.g., low latency, high utilization), relying on TCP as their preferred transport protocol. However, the suitability of TCP for such low-priority flows is relatively unexplored. Specifically, how prioritization-induced delays in packet transmission can cause spurious timeouts and low utilization. In this paper, we conduct an empirical study to investigate the performance of TCP for low-priority flows under a wide range of realistic scenarios: use-cases (with accompanying workloads) where the performance of low-priority flows is crucial to the functioning of the overall system as well as various network loads and other network parameters. Our findings yield two key insights: 1) for several popular use-cases (e.g., network scheduling), TCP's performance for low-priority flows is within 2x of a near-optimal scheme, 2) for emerging workloads that exhibit an on-off behavior in the high priority queue (e.g., distributed ML model training), TCP's performance for low-priority flows is poor. Finally, we discuss and conduct preliminary evaluation to show that two simple strategies -- weighted fair queuing (WFQ) and cross-queue congestion notification -- can substantially improve TCP's performance for low-priority flows.

Characterizing TCP's Performance for Low-Priority Flows Inside a Cloud

TL;DR

This paper investigates whether TCP is suitable for low-priority traffic in cloud environments that implement network prioritization. It uses an empirical study combining a small-scale testbed and NS3 simulations with a near-optimal Baseline (Near-Opt) to quantify TCP's performance across key use-cases, workloads, and configurations, revealing that TCP is often within a modest margin of Near-Opt for certain scheduling scenarios but can suffer extreme degradation under on-off high-priority traffic. The authors identify spurious timeouts and feedback-convergence issues as primary failure modes for low-priority flows and demonstrate that two simple mitigations—weighted fair queuing (WFQ) and cross-queue congestion notification (CQCN)—can substantially improve low-priority completion times. The work highlights broader implications for data center transport design and suggests directions for integrating stateful network information into end-host congestion control to improve performance under prioritization.

Abstract

Many cloud systems utilize low-priority flows to achieve various performance objectives (e.g., low latency, high utilization), relying on TCP as their preferred transport protocol. However, the suitability of TCP for such low-priority flows is relatively unexplored. Specifically, how prioritization-induced delays in packet transmission can cause spurious timeouts and low utilization. In this paper, we conduct an empirical study to investigate the performance of TCP for low-priority flows under a wide range of realistic scenarios: use-cases (with accompanying workloads) where the performance of low-priority flows is crucial to the functioning of the overall system as well as various network loads and other network parameters. Our findings yield two key insights: 1) for several popular use-cases (e.g., network scheduling), TCP's performance for low-priority flows is within 2x of a near-optimal scheme, 2) for emerging workloads that exhibit an on-off behavior in the high priority queue (e.g., distributed ML model training), TCP's performance for low-priority flows is poor. Finally, we discuss and conduct preliminary evaluation to show that two simple strategies -- weighted fair queuing (WFQ) and cross-queue congestion notification -- can substantially improve TCP's performance for low-priority flows.
Paper Structure (43 sections, 11 figures, 2 tables)

This paper contains 43 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: TCP assumes fair-scheduling inside the network. This assumption fails under priority scheduling. (a) depicts TCP's view of the network (i.e., all packets from low-priority (Blue) and high priority (Red) senders share the same queue). (b) shows that in reality packets from different priority classes are stored in their respective priority queue and they are serviced in the order of their priority.
  • Figure 2: Shows substantial impact of priority queuing on TCP's performance for low-priority flows. (a) highlights a significant increase in the retransmission rate experienced by low-priority flows under priority queues compared to fairshare. (b) highlights the impact of priority queuing on flow completion times (FCT) of low-priority flows between TCP and Near-Opt (§\ref{['sec:eval-main']}).
  • Figure 3: TCP's performance for low-priority flows under network scheduling scenario. (a) Shows that for DAS das workload, TCP's performance for flows is insensitive to the flow sizes. (b) and (c) shows TCP's performance for long flows under shortest job first policy (SJF) for the web-search and data-mining workloads. Under SJF, we assign low-priority to long flows (size greater than 1 MB).
  • Figure 4: TCP's performance under workload co-location scenario across different loads, flow sizes and ML-model gradient update sizes in the high priority queue: (a) shows the performance of TCP for low-priority flows when high priority traffic exhibit an on-off behavior caused by the gradient update of size 4MB across different loads. (b) evaluates TCP across different flow size for the same workload at high load. (c) compares TCP's performance across different gradient update sizes.
  • Figure 5: Shows TCP's performance for Hybrid Services use-case at 80% total load (High + low-priority). (a) Shows TCP's performance at 30% load in the low-priority queue across different flow sizes. In this case, TCP's performance improves as the flow size decreases. (b) Shows TCP's performance for 128KB flow sizes across different load settings for low-priority queue. With increased load in the low-priority queue the performance gap between TCP and Near-Opt increases.
  • ...and 6 more figures