Table of Contents
Fetching ...

PANDA: Noise-Resilient Antagonist Identification in Production Datacenters

Sixiang Zhou, Nan Deng, Krzysiek Rzadca, Xiaojun Lin, Y. Charlie Hu

TL;DR

PANDA tackles performance interference in production datacenters by identifying antagonistic jobs that degrade co-located tasks, addressing the limitations of offline profiling and noisy production sampling. It introduces a global, noise-resilient framework that uses CPI as the primary metric, a machine-level CPI called $mnCPI$, and a per-job antagonism coefficient $a_j$ learned offline to enable robust online detection. Evaluation on GoogleTraceV3 shows PANDA dramatically improves true antagonist ranking (≈82.6 percentile) versus ≈54–56 percentile for baselines and achieves perfect consistency in multi-victim scenarios, all with negligible overhead. The approach offers a scalable path toward interference-aware scheduling in large production datacenters.

Abstract

Modern warehouse-scale datacenters commonly collocate multiple jobs on shared machines to improve resource utilization. However, such collocation often leads to performance interference caused by antagonistic jobs that overconsume shared resources. Existing antagonist-detection approaches either rely on offline profiling, which is costly and unscalable, or use a sample-from-production approach, which suffers from noisy measurements and fails under multi-victim scenarios. We present PANDA, a noise-resilient antagonist identification framework for production-scale datacenters. Like prior correlation-based methods, PANDA uses cycles per instruction (CPI) as its performance metric, but it differs by (i) leveraging global historical knowledge across all machines to suppress sampling noise and (ii) introducing a machine-level CPI metric that captures shared-resource contention among multiple co-located tasks. Evaluation on a recent Google production trace shows that PANDA ranks true antagonists far more accurately than prior methods -- improving average suspicion percentile from 50-55% to 82.6% -- and achieves consistent antagonist identification under multi-victim scenarios, all with negligible runtime overhead.

PANDA: Noise-Resilient Antagonist Identification in Production Datacenters

TL;DR

PANDA tackles performance interference in production datacenters by identifying antagonistic jobs that degrade co-located tasks, addressing the limitations of offline profiling and noisy production sampling. It introduces a global, noise-resilient framework that uses CPI as the primary metric, a machine-level CPI called , and a per-job antagonism coefficient learned offline to enable robust online detection. Evaluation on GoogleTraceV3 shows PANDA dramatically improves true antagonist ranking (≈82.6 percentile) versus ≈54–56 percentile for baselines and achieves perfect consistency in multi-victim scenarios, all with negligible overhead. The approach offers a scalable path toward interference-aware scheduling in large production datacenters.

Abstract

Modern warehouse-scale datacenters commonly collocate multiple jobs on shared machines to improve resource utilization. However, such collocation often leads to performance interference caused by antagonistic jobs that overconsume shared resources. Existing antagonist-detection approaches either rely on offline profiling, which is costly and unscalable, or use a sample-from-production approach, which suffers from noisy measurements and fails under multi-victim scenarios. We present PANDA, a noise-resilient antagonist identification framework for production-scale datacenters. Like prior correlation-based methods, PANDA uses cycles per instruction (CPI) as its performance metric, but it differs by (i) leveraging global historical knowledge across all machines to suppress sampling noise and (ii) introducing a machine-level CPI metric that captures shared-resource contention among multiple co-located tasks. Evaluation on a recent Google production trace shows that PANDA ranks true antagonists far more accurately than prior methods -- improving average suspicion percentile from 50-55% to 82.6% -- and achieves consistent antagonist identification under multi-victim scenarios, all with negligible runtime overhead.

Paper Structure

This paper contains 23 sections, 4 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: PDF of the absolute CPI sample difference between twin tasks and between random task pairs, shown for the four jobs with the highest number of twin-task occurrences.