Table of Contents
Fetching ...

Omnibenchmark: transparent, reproducible, extensible and standardized orchestration of solo and collaborative benchmarks

Izaskun Mallona, Almut Luetge, Ben Carrillo, Daniel Incicau, Reto Gerber, Aidan Meara, Anthony Sonrel, Charlotte Soneson, Mark D. Robinson

TL;DR

A novel benchmarking system, Omnibenchmark, that facilitates benchmark formalization and execution in both solo and community efforts and provides an unprecedented flexibility such that existing benchmark designs can be forked and extended, run separately or collaboratively, giving versioned and standardized result outputs and therefore much-needed transparency to the analysis and interpretation of benchmark results.

Abstract

Benchmarking involves designing, running and disseminating rigorous performance assessments of methods, most often for data analysis and software tools, but the process can also be applied to experimental systems. Ideally, a benchmarking system is used to facilitate the benchmarking process by providing a structured entrypoint to design, coordinate, execute, and store standardized benchmarks. We describe a novel benchmarking system, Omnibenchmark, that facilitates benchmark formalization and execution in both solo and community efforts. Omnibenchmark provides a flexible benchmark plan syntax (i.e., a configuration YAML file), dynamic workflow generation based on Snakemake, S3-compatible storage handling, and reproducible software environments using environment modules, Apptainer or Conda. Such a setup provides an unprecedented flexibility such that existing benchmark designs can be forked and extended, run separately or collaboratively, giving versioned and standardized result outputs and therefore much-needed transparency to the analysis and interpretation of benchmark results. Tutorials and installation instructions are available from https://omnibenchmark.org.

Omnibenchmark: transparent, reproducible, extensible and standardized orchestration of solo and collaborative benchmarks

TL;DR

A novel benchmarking system, Omnibenchmark, that facilitates benchmark formalization and execution in both solo and community efforts and provides an unprecedented flexibility such that existing benchmark designs can be forked and extended, run separately or collaboratively, giving versioned and standardized result outputs and therefore much-needed transparency to the analysis and interpretation of benchmark results.

Abstract

Benchmarking involves designing, running and disseminating rigorous performance assessments of methods, most often for data analysis and software tools, but the process can also be applied to experimental systems. Ideally, a benchmarking system is used to facilitate the benchmarking process by providing a structured entrypoint to design, coordinate, execute, and store standardized benchmarks. We describe a novel benchmarking system, Omnibenchmark, that facilitates benchmark formalization and execution in both solo and community efforts. Omnibenchmark provides a flexible benchmark plan syntax (i.e., a configuration YAML file), dynamic workflow generation based on Snakemake, S3-compatible storage handling, and reproducible software environments using environment modules, Apptainer or Conda. Such a setup provides an unprecedented flexibility such that existing benchmark designs can be forked and extended, run separately or collaboratively, giving versioned and standardized result outputs and therefore much-needed transparency to the analysis and interpretation of benchmark results. Tutorials and installation instructions are available from https://omnibenchmark.org.
Paper Structure (22 sections, 6 figures, 3 tables)

This paper contains 22 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The life cycle of an Omnibenchmark. The four main stages are depicted linearly for simplicity. A, the same person authors the plan and modules; results are typically kept locally. B, a group collaborates on a plan via merge requests, publishes different modules, and schedules runs on reference hardware. Results are best published to remote storage. Note that the workflow can switch between solo and collaborative at any step. C, commands useful at each stage (full list in Table \ref{['tab:commands']}).
  • Figure 2: Implementation of the omni-clustering-benchmarks framework and porting the OpenProblems Spatially Variable Genes benchmark. A, simplified benchmark plan for omni-clustering-benchmarks; note that each module at each stage implements multiple datasets, methods, or metrics via parameters (in total here, 62 datasets, 48 methods and 3 metrics are included in omni-clustering-benchmarks; this benchmark plan was run 18 times, for 3 seeds, 2 runs, and 3 software backends). B, computed clustering performance (Adjusted Rand Index) with Apptainer software versus Conda software. C, computed clustering performance (Adjusted Rand Index) with EasyBuild software versus Conda software. D, relative CPU time of Apptainer software compared to Conda software versus Conda CPU time. E, relative CPU time of EasyBuild software compared to Conda software versus Conda CPU time. F, simplified benchmark plan for omni-OpenProblems-svg; here, the load_spatial_data module parameterizes 16 datasets, each methods module implements a single method; the correlation module implements Kendall's $\tau$ against the ground truth. G, omni-OpenProblems-svg reproduces to a high degree the Kendall's $\tau$ values computed in the original OpenProblems benchmark.
  • Figure S3: Relative CPU time of Apptainer (left) and EasyBuild (right) software compared to Conda software versus Conda CPU time. All run times and ratios are presented here (filtering is applied in Figure \ref{['fig:example_benchmarks']})
  • Figure S4: Relative maximum Resident Set Size (RSS) of Apptainer (left) and EasyBuild (right) software compared to Conda software versus Conda maximum RSS (in megabytes).
  • Figure S5: The effort on algorithmic performance (here, change in Adjusted Rand Index relative to no cluster number misspecification) by misspecifying the number of clusters. Results are split by their module and faceted by the number of true clusters.
  • ...and 1 more figures