Automated Calibration of Parallel and Distributed Computing Simulators: A Case Study

Jesse McDonald; Maximilian Horzela; Frédéric Suter; Henri Casanova

Automated Calibration of Parallel and Distributed Computing Simulators: A Case Study

Jesse McDonald, Maximilian Horzela, Frédéric Suter, Henri Casanova

TL;DR

This paper tackles the challenge that PDC simulator calibration is often undocumented or labor-intensive, hindering the reliability of simulation-based conclusions.It proposes automated calibration using simple search-based algorithms in log-space parameter ranges and evaluates performance under a fixed time budget $T$ on real-case WLCG/Hep workloads, comparing against domain-scientist calibration.Across a CMS WLCG-inspired case study, automated methods generally improve Mean Relative Error (MRE) over manual calibration, with substantial gains when bottleneck resources are correctly identified, and they enable favorable speed-accuracy trade-offs.The work demonstrates that, even with limited ground-truth data, diverse calibration inputs can yield robust calibration, and it highlights how rapid calibration iterations can be achieved by adjusting simulation block sizes and time budgets.Overall, the study motivates broader adoption of automated calibration for PDC simulators and points toward incorporating advanced optimization techniques, such as Bayesian optimization, to handle larger parameter spaces in future work.

Abstract

Many parallel and distributed computing research results are obtained in simulation, using simulators that mimic real-world executions on some target system. Each such simulator is configured by picking values for parameters that define the behavior of the underlying simulation models it implements. The main concern for a simulator is accuracy: simulated behaviors should be as close as possible to those observed in the real-world target system. This requires that values for each of the simulator's parameters be carefully picked, or "calibrated," based on ground-truth real-world executions. Examining the current state of the art shows that simulator calibration, at least in the field of parallel and distributed computing, is often undocumented (and thus perhaps often not performed) and, when documented, is described as a labor-intensive, manual process. In this work we evaluate the benefit of automating simulation calibration using simple algorithms. Specifically, we use a real-world case study from the field of High Energy Physics and compare automated calibration to calibration performed by a domain scientist. Our main finding is that automated calibration is on par with or significantly outperforms the calibration performed by the domain scientist. Furthermore, automated calibration makes it straightforward to operate desirable trade-offs between simulation accuracy and simulation speed.

Automated Calibration of Parallel and Distributed Computing Simulators: A Case Study

TL;DR

Abstract

Automated Calibration of Parallel and Distributed Computing Simulators: A Case Study

Authors

TL;DR

Abstract

Table of Contents

Figures (2)