Characterizing TCP's Performance for Low-Priority Flows Inside a Cloud
Hafiz Mohsin Bashir, Abdullah Bin Faisal, Fahad R. Dogar
TL;DR
This paper investigates whether TCP is suitable for low-priority traffic in cloud environments that implement network prioritization. It uses an empirical study combining a small-scale testbed and NS3 simulations with a near-optimal Baseline (Near-Opt) to quantify TCP's performance across key use-cases, workloads, and configurations, revealing that TCP is often within a modest margin of Near-Opt for certain scheduling scenarios but can suffer extreme degradation under on-off high-priority traffic. The authors identify spurious timeouts and feedback-convergence issues as primary failure modes for low-priority flows and demonstrate that two simple mitigations—weighted fair queuing (WFQ) and cross-queue congestion notification (CQCN)—can substantially improve low-priority completion times. The work highlights broader implications for data center transport design and suggests directions for integrating stateful network information into end-host congestion control to improve performance under prioritization.
Abstract
Many cloud systems utilize low-priority flows to achieve various performance objectives (e.g., low latency, high utilization), relying on TCP as their preferred transport protocol. However, the suitability of TCP for such low-priority flows is relatively unexplored. Specifically, how prioritization-induced delays in packet transmission can cause spurious timeouts and low utilization. In this paper, we conduct an empirical study to investigate the performance of TCP for low-priority flows under a wide range of realistic scenarios: use-cases (with accompanying workloads) where the performance of low-priority flows is crucial to the functioning of the overall system as well as various network loads and other network parameters. Our findings yield two key insights: 1) for several popular use-cases (e.g., network scheduling), TCP's performance for low-priority flows is within 2x of a near-optimal scheme, 2) for emerging workloads that exhibit an on-off behavior in the high priority queue (e.g., distributed ML model training), TCP's performance for low-priority flows is poor. Finally, we discuss and conduct preliminary evaluation to show that two simple strategies -- weighted fair queuing (WFQ) and cross-queue congestion notification -- can substantially improve TCP's performance for low-priority flows.
