Table of Contents
Fetching ...

Rethinking Performance Analysis for Configurable Software Systems: A Case Study from a Fitness Landscape Perspective

Mingyu Huang, Peili Mao, Ke Li

TL;DR

Configurable software systems pose challenges for understanding how configurations map to performance due to black-box, high-dimensional spaces. The authors propose a fitness landscape perspective and GraphFLA, a graph-based framework to model configuration spaces as landscapes and enable scalable analysis. They conduct a large-scale case study on LLVM, Apache, and SQLite across 32 workloads, collecting over 86 million configurations to reveal six key findings about fitness distributions, prominent regions, ruggedness, optima distributions, and option interactions, with implications for tuning and performance modeling. They also show how surrogate models and optimization procedures behave in rugged landscapes and provide open data and a flexible toolkit for researchers to analyze configurable systems.

Abstract

Modern software systems are often highly configurable to tailor varied requirements from diverse stakeholders. Understanding the mapping between configurations and the desired performance attributes plays a fundamental role in advancing the controllability and tuning of the underlying system, yet has long been a dark hole of knowledge due to its black-box nature. While there have been previous efforts in performance analysis for these systems, they analyze the configurations as isolated data points without considering their inherent spatial relationships. This renders them incapable of interrogating many important aspects of the configuration space like local optima. In this work, we advocate a novel perspective to rethink performance analysis -- modeling the configuration space as a structured ``landscape''. To support this proposition, we designed \our, an open-source, graph data mining empowered fitness landscape analysis (FLA) framework. By applying this framework to $86$M benchmarked configurations from $32$ running workloads of $3$ real-world systems, we arrived at $6$ main findings, which together constitute a holistic picture of the landscape topography, with thorough discussions about their implications on both configuration tuning and performance modeling.

Rethinking Performance Analysis for Configurable Software Systems: A Case Study from a Fitness Landscape Perspective

TL;DR

Configurable software systems pose challenges for understanding how configurations map to performance due to black-box, high-dimensional spaces. The authors propose a fitness landscape perspective and GraphFLA, a graph-based framework to model configuration spaces as landscapes and enable scalable analysis. They conduct a large-scale case study on LLVM, Apache, and SQLite across 32 workloads, collecting over 86 million configurations to reveal six key findings about fitness distributions, prominent regions, ruggedness, optima distributions, and option interactions, with implications for tuning and performance modeling. They also show how surrogate models and optimization procedures behave in rugged landscapes and provide open data and a flexible toolkit for researchers to analyze configurable systems.

Abstract

Modern software systems are often highly configurable to tailor varied requirements from diverse stakeholders. Understanding the mapping between configurations and the desired performance attributes plays a fundamental role in advancing the controllability and tuning of the underlying system, yet has long been a dark hole of knowledge due to its black-box nature. While there have been previous efforts in performance analysis for these systems, they analyze the configurations as isolated data points without considering their inherent spatial relationships. This renders them incapable of interrogating many important aspects of the configuration space like local optima. In this work, we advocate a novel perspective to rethink performance analysis -- modeling the configuration space as a structured ``landscape''. To support this proposition, we designed \our, an open-source, graph data mining empowered fitness landscape analysis (FLA) framework. By applying this framework to M benchmarked configurations from running workloads of real-world systems, we arrived at main findings, which together constitute a holistic picture of the landscape topography, with thorough discussions about their implications on both configuration tuning and performance modeling.

Paper Structure

This paper contains 26 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: In software configuration space (panel B), configurations are spatially related to each other, and so are their associated performance values. Traditional performance analysis (panel C) only considers the distribution of performance values as isolated data points. Our approach (panel A), instead, additionally incorporates the neighborhood relationships between configurations, which are used to construct a configuration landscape that reveals the spatial distribution of performance values across the configuration space.
  • Figure 2: Schematic overview of of GraphFLA.
  • Figure 3: General fitness distributions and prominent regions. (A) Normalized fitness distributions of three workloads of LLVM. (B) Spearman's $\rho$ of fitness between all pairs of workloads in each system. (C) Distributions of pairwise distance between top-$1\%$ configurations for a LLVM landscape and randomly sampled configurations. (D) 2D projection of the distribution of top-$1\%$ configurations in a LLVM landscape as compared to total configurations. (E) Overlaps in top-$1\%$ configurations between all pairs of workloads in each system. (F) Average rank shifts (percentile) of fitness for top-$1\%$ configurations between workloads.
  • Figure 4: Local optima, global optimum, and landscape ruggedness.(A) Number of local optima and total configurations in landscapes, error bars indicate s.t.d. across workloads. (B) Autocorrelations of landscapes of each system under different lag distances, aggregated across workloads. (C) Exemplar distributions of pairwise distance between all pairs of local optima in a landscape and the same number of randomly sampled configs. (D) Exemplar distribution of the distance between each local optimum and the global optimum for a landscape. (E) EMD for local optima between all pairs of workloads in each system. (F) 2D projection of the distribution of global optimum in each of the $12$ LLVM landscapes as compared to total configurations (grey).
  • Figure 5: Individual and interactive fitness effects of options.(A). The distribution of fitness effects (normalized) of altering a single option of LLVM in all possible configuration backgrounds. (B) Aggregated fitness effects (i.e., feature importance) of each option across all backgrounds. (C) The Spearman correlation between option's overall fitness effect (a.k.a. feature importance) across all workloads of each system. (D) Fraction of non-zero interaction coefficients as determined by LASSO regression. (E) The Spearman correlation between option's pairwise interaction effects (a.k.a. feature interactions) across all workloads of each system. (F)$R^2$ score of a random forest regressor in fitting the landscape when using different degrees of interactions (max_depth). (G) The same as (A), but plots all $20$ options in LLVM.
  • ...and 3 more figures