Table of Contents
Fetching ...

Investigating the Impact of Isolation on Synchronized Benchmarks

Nils Japke, Furat Hamdan, Diana Baumann, David Bermbach

TL;DR

This work tackles cloud performance variability by using duet benchmarking to compare two SUT versions on the same VM, mitigating external interference. It introduces a noise generator and evaluates three isolation strategies—cgroups with CPU pinning, Docker containers, and Firecracker MicroVMs—against an unisolated baseline, using bootstrap confidence intervals and Wilcoxon tests to detect latency changes. The findings reveal that Docker containers exhibit greater susceptibility to noise and higher false positives, while cgroups/CPU pinning and Firecracker MicroVMs provide better isolation, with MicroVMs performing best overall. The study provides actionable guidance for selecting isolation techniques in synchronized benchmarks and offers replication artifacts to support broader adoption and validation in cloud performance benchmarking.

Abstract

Benchmarking in cloud environments suffers from performance variability from multi-tenant resource contention. Duet benchmarking mitigates this by running two workload versions concurrently on the same VM, exposing them to identical external interference. However, intra-VM contention between synchronized workloads necessitates additional isolation mechanisms. This work evaluates three such strategies: cgroups and CPU pinning, Docker containers, and Firecracker MicroVMs. We compare all strategies with an unisolated baseline experiment, by running benchmarks with a duet setup alongside a noise generator. This noise generator "steals" compute resources to degrade performance measurements. All experiments showed different latency distributions while under the effects of noise generation, but results show that process isolation generally lowered false positives, except for our experiments with Docker containers. Even though Docker containers rely internally on cgroups and CPU pinning, they were more susceptible to performance degradation due to noise influence. Therefore, we recommend to use process isolation for synchronized workloads, with the exception of Docker containers.

Investigating the Impact of Isolation on Synchronized Benchmarks

TL;DR

This work tackles cloud performance variability by using duet benchmarking to compare two SUT versions on the same VM, mitigating external interference. It introduces a noise generator and evaluates three isolation strategies—cgroups with CPU pinning, Docker containers, and Firecracker MicroVMs—against an unisolated baseline, using bootstrap confidence intervals and Wilcoxon tests to detect latency changes. The findings reveal that Docker containers exhibit greater susceptibility to noise and higher false positives, while cgroups/CPU pinning and Firecracker MicroVMs provide better isolation, with MicroVMs performing best overall. The study provides actionable guidance for selecting isolation techniques in synchronized benchmarks and offers replication artifacts to support broader adoption and validation in cloud performance benchmarking.

Abstract

Benchmarking in cloud environments suffers from performance variability from multi-tenant resource contention. Duet benchmarking mitigates this by running two workload versions concurrently on the same VM, exposing them to identical external interference. However, intra-VM contention between synchronized workloads necessitates additional isolation mechanisms. This work evaluates three such strategies: cgroups and CPU pinning, Docker containers, and Firecracker MicroVMs. We compare all strategies with an unisolated baseline experiment, by running benchmarks with a duet setup alongside a noise generator. This noise generator "steals" compute resources to degrade performance measurements. All experiments showed different latency distributions while under the effects of noise generation, but results show that process isolation generally lowered false positives, except for our experiments with Docker containers. Even though Docker containers rely internally on cgroups and CPU pinning, they were more susceptible to performance degradation due to noise influence. Therefore, we recommend to use process isolation for synchronized workloads, with the exception of Docker containers.

Paper Structure

This paper contains 14 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: An illustration of the general experiment setup. The different groups on both VMs represent the different isolation strategies, i.e., using cgroups and CPU pinning, using Docker with similar functionality, and using Firecracker MicroVMs with similar functionality. In the baseline experiment, we use no isolation strategy, which means that this blue box does not exist for that particular setup.
  • Figure 2: Boxplots for all results of Experiment 1 -- Baseline. For each endpoint, we show the distribution of relative changes in request latency throughout the experiment for all configurations. We also show the distribution outside the noise generation (no noise), during noise generation (only noise), and all result data together (all data).
  • Figure 3: Boxplots for all results of Experiment 2 -- cgroups and CPU pinning. For each endpoint, we show the distribution of relative changes in request latency throughout the experiment for all configurations. We also show the distribution outside the noise generation (no noise), during noise generation (only noise), and all result data together (all data).
  • Figure 4: Boxplots for all results of Experiment 3 -- Docker containers. For each endpoint, we show the distribution of relative changes in request latency throughout the experiment for all configurations. We also show the distribution outside the noise generation (no noise), during noise generation (only noise), and all result data together (all data).
  • Figure 5: Boxplots for all results of Experiment 4 -- Firecracker MicroVMs. For each endpoint, we show the distribution of relative changes in request latency throughout the experiment for all configurations. We also show the distribution outside the noise generation (no noise), during noise generation (only noise), and all result data together (all data).