ElastiBench: Scalable Continuous Benchmarking on Cloud FaaS Platforms

Trever Schirmer; Tobias Pfandzelter; David Bermbach

ElastiBench: Scalable Continuous Benchmarking on Cloud FaaS Platforms

Trever Schirmer, Tobias Pfandzelter, David Bermbach

TL;DR

ElastiBench addresses the inefficiency of continuous microbenchmarking in CI/CD by exploiting the elastic parallelism of cloud Function-as-a-Service platforms. It deploys two versions of a Software Under Test inside the same FaaS function, runs thousands of microbenchmarks in parallel, and uses bootstrapped medians to detect relative performance changes while mitigating cloud noise. The authors implement a Go-based PoC on AWS Lambda, showing that ElastiBench can reduce execution time to about 15 minutes and cut costs compared with VM-based benchmarking, while achieving around 95% reliable change detection. This approach enables more frequent, feedback-rich performance regression checks and is extensible to additional languages and benchmarks, with ongoing work to optimize resource configuration and benchmarking strategies.

Abstract

Running microbenchmark suites often and early in the development process enables developers to identify performance issues in their application. Microbenchmark suites of complex applications can comprise hundreds of individual benchmarks and take multiple hours to evaluate meaningfully, making running those benchmarks as part of CI/CD pipelines infeasible. In this paper, we reduce the total execution time of microbenchmark suites by leveraging the massive scalability and elasticity of FaaS (Function-as-a-Service) platforms. While using FaaS enables users to quickly scale up to thousands of parallel function instances to speed up microbenchmarking, the performance variation and low control over the underlying computing resources complicate reliable benchmarking. We demonstrate an architecture for executing microbenchmark suites on cloud FaaS platforms and evaluate it on code changes from an open-source time series database. Our evaluation shows that our prototype can produce reliable results (~95% of performance changes accurately detected) in a quarter of the time (<=15min vs.~4h) and at lower cost ($0.49 vs. ~$1.18) compared to cloud-based virtual machines.

ElastiBench: Scalable Continuous Benchmarking on Cloud FaaS Platforms

TL;DR

Abstract

1.18) compared to cloud-based virtual machines.

Paper Structure (26 sections, 7 figures)

This paper contains 26 sections, 7 figures.

Introduction
Microbenchmarking in the Cloud
Challenges for Microbenchmarking on FaaS
Performance
Restricted Environment
Representative Environment
Executing Microbenchmarks in FaaS
Proof-of-Concept Implementation
Evaluation
Experiment Design
Experiment Results
A/A Experiment
Baseline Experiment
Replication Experiment
Lower Memory Experiment
...and 11 more sections

Figures (7)

Figure 1: The traditional approach of microbenchmarking (top) relies on executing tasks (different circles) in random order multiple times on different virtual machines (gray boxes) to get reliable results. Using FaaS, these tasks can be executed on multiple function instances in parallel (bottom). With instance parallelism (three in this example), the duration of the suite run can be drastically reduced while also reducing inter-microbenchmark influences.
Figure 2: The process used to collect benchmarking data from FaaS functions. First, the function image (cf. \ref{['sec:impl']}) is built and deployed to the FaaS platform. In step two, the function is called repeatedly with configurable repeats per microbenchmarks and instance parallelism. The results of all calls are then analyzed on the calling machine. The calling system can be the workstation of a developer or an automated CI/CD pipeline. While this figure shows multiple microbenchmarks being executed in one function call, \ref{['img:overview']} shows the extreme example of just one microbenchmark per function call.
Figure 3: When comparing the difference in performance between two versions, the confidence interval shows range where the real difference likely is. If the CI overlaps zero (black curve), we detect no performance change. If it does not overlap zero (yellow curve), we detect a performance change.
Figure 4: CDF showing the performance differences identified in the A/A experiment. While some microbenchmarks have a high difference, all of them are correctly categorized as no performance change.
Figure 5: CDF showing the performance differences identified in the baseline experiment. Performance changes have a generally higher difference than non-changes, with the median performance change being 3.08%.
...and 2 more figures

ElastiBench: Scalable Continuous Benchmarking on Cloud FaaS Platforms

TL;DR

Abstract

ElastiBench: Scalable Continuous Benchmarking on Cloud FaaS Platforms

Authors

TL;DR

Abstract

Table of Contents

Figures (7)