Table of Contents
Fetching ...

Do Data Center Network Metrics Predict Application-Facing Performance?

Brian Chang, Jeffrey C. Mogul, Rui Wang, Mingyang Zhang, Aditya Akella

TL;DR

It is found that simple linear models often have the lowest error, while queueing-based models are better in a few cases, and no single network metric is universally the best predictor.

Abstract

Applications that run in large-scale data center networks (DCNs) rely on the DCN's ability to deliver application requests in a performant manner. DCNs expose a complex design and operational space, and network designers and operators care how different options along this space affect application performance. One might run controlled experiments and measure the corresponding application-facing performance, but such experiments become progressively infeasible at a large scale, and simulations risk yielding inaccurate or incomplete results. Instead, we show that we can predict application-facing performance through more easily measured network metrics. For example, network telemetry metrics (e.g., link utilization) can predict application-facing metrics (e.g., transfer latency). Through large-scale measurements of production networks, we study the correlation between the two types of metrics, and construct predictive, interpretable models that serve as a suggestive guideline to network designers and operators. We show that no single network metric is universally the best predictor (even though some prior work has focused on a single predictor). We found that simple linear models often have the lowest error, while queueing-based models are better in a few cases.

Do Data Center Network Metrics Predict Application-Facing Performance?

TL;DR

It is found that simple linear models often have the lowest error, while queueing-based models are better in a few cases, and no single network metric is universally the best predictor.

Abstract

Applications that run in large-scale data center networks (DCNs) rely on the DCN's ability to deliver application requests in a performant manner. DCNs expose a complex design and operational space, and network designers and operators care how different options along this space affect application performance. One might run controlled experiments and measure the corresponding application-facing performance, but such experiments become progressively infeasible at a large scale, and simulations risk yielding inaccurate or incomplete results. Instead, we show that we can predict application-facing performance through more easily measured network metrics. For example, network telemetry metrics (e.g., link utilization) can predict application-facing metrics (e.g., transfer latency). Through large-scale measurements of production networks, we study the correlation between the two types of metrics, and construct predictive, interpretable models that serve as a suggestive guideline to network designers and operators. We show that no single network metric is universally the best predictor (even though some prior work has focused on a single predictor). We found that simple linear models often have the lowest error, while queueing-based models are better in a few cases.

Paper Structure

This paper contains 43 sections, 7 equations, 26 figures, 6 tables.

Figures (26)

  • Figure 1: Hypothetical latency vs. link utilization.
  • Figure 2: A typical Clos data center topology.
  • Figure 3: Overview of the approach
  • Figure 4: Example scatterplot of latency vs. utilization
  • Figure 5: Links used for inter-block NLMs
  • ...and 21 more figures