Surrogate Modeling for Scalable Evaluation of Distributed Computing Systems for HEP Applications
Larissa Schmid, Maximilian Horzela, Valerii Zhyla, Manuel Giffels, Günter Quast, Anne Koziolek
TL;DR
This work addresses scalable evaluation of WLCG-style distributed systems by training ML surrogate models on DCSim outputs to predict per-job observables from platform, datasets, and workload configurations. It compares BiGRU, BiLSTM, and Transformer encoders using sequence-to-sequence prediction with MSE loss, achieving orders-of-magnitude faster predictions while preserving key dynamics, especially for compute_time in heterogeneous workloads. Limitations arise from insufficient platform context in inputs, as transfer-time predictions struggle and some distribution features are not fully captured; future work includes adding platform descriptors and real-world data to improve generalization. Overall, the approach enables rapid exploration of infrastructure designs at scales infeasible with direct simulation, with potential impact on planning and optimization in the WLCG ecosystem.
Abstract
The Worldwide LHC Computing Grid (WLCG) provides the robust computing infrastructure essential for the LHC experiments by integrating global computing resources into a cohesive entity. Simulations of different compute models present a feasible approach for evaluating future adaptations that are able to cope with future increased demands. However, running these simulations incurs a trade-off between accuracy and scalability. For example, while the simulator DCSim can provide accurate results, it falls short on scaling with the size of the simulated platform. Using Generative Machine Learning as a surrogate presents a candidate for overcoming this challenge. In this work, we evaluate the usage of three different Machine Learning models for the simulation of distributed computing systems and assess their ability to generalize to unseen situations. We show that those models can predict central observables derived from execution traces of compute jobs with approximate accuracy but with orders of magnitude faster execution times. Furthermore, we identify potentials for improving the predictions towards better accuracy and generalizability.
