Measurement-based Resource Allocation and Control in Data Centers: A Survey
Diana Andreea Popescu
TL;DR
This survey addresses how measurement-driven resource allocation and control can yield predictable cloud performance in data centers despite network interference. It systematically reviews network measurement fundamentals, data-center architectures, traffic characteristics, workload traces, monitoring systems, and scheduling frameworks, placing a special emphasis on network-aware and ML-assisted approaches. The work highlights the evolution from SDN-based to programmable dataplane monitoring, evaluates bandwidth and tail-latency guarantees, and surveys ML-driven schedulers that leverage telemetry to optimize performance. Its holistic perspective identifies gaps between trace availability, network-state integration, and scheduling decisions, and it argues that combining flow-level measurement with topology-aware optimization is key to practical, scalable SLA attainment in modern data centers.
Abstract
Data centers have become ubiquitous for today's businesses. From banks to startups, they rely on cloud infrastructure to deploy user applications. In this context, it is vital to provide users with application performance guarantees. Network interference is one of the causes of unpredictable application performance, and many solutions have been proposed over the years. The main objective of this survey is to familiarize the reader with research into network measurement-based resource allocation and control in data centers, focusing on network resources in order to provide cloud performance guarantees. We start with a primer on general network measurement techniques and data center network and applications to give the reader context. We then summarize the characteristics of network traffic and cluster workloads in data centers, which are pivotal for measurement-based allocation and control. We study and compare network monitoring in data centers, giving an overview on their evolution from Software-Defined Networking (SDN) to programmable dataplanes-based. The network monitoring information can serve as input to cluster allocation and scheduling decisions. We next categorize cluster scheduling frameworks, and perform an analysis of those that provide network guarantees in data centers, and we also look at emergent Machine Learning-driven resource allocation and control. We conclude with a discussion about future research directions.
