Table of Contents
Fetching ...

Surrogate Modeling for Scalable Evaluation of Distributed Computing Systems for HEP Applications

Larissa Schmid, Maximilian Horzela, Valerii Zhyla, Manuel Giffels, Günter Quast, Anne Koziolek

TL;DR

This work addresses scalable evaluation of WLCG-style distributed systems by training ML surrogate models on DCSim outputs to predict per-job observables from platform, datasets, and workload configurations. It compares BiGRU, BiLSTM, and Transformer encoders using sequence-to-sequence prediction with MSE loss, achieving orders-of-magnitude faster predictions while preserving key dynamics, especially for compute_time in heterogeneous workloads. Limitations arise from insufficient platform context in inputs, as transfer-time predictions struggle and some distribution features are not fully captured; future work includes adding platform descriptors and real-world data to improve generalization. Overall, the approach enables rapid exploration of infrastructure designs at scales infeasible with direct simulation, with potential impact on planning and optimization in the WLCG ecosystem.

Abstract

The Worldwide LHC Computing Grid (WLCG) provides the robust computing infrastructure essential for the LHC experiments by integrating global computing resources into a cohesive entity. Simulations of different compute models present a feasible approach for evaluating future adaptations that are able to cope with future increased demands. However, running these simulations incurs a trade-off between accuracy and scalability. For example, while the simulator DCSim can provide accurate results, it falls short on scaling with the size of the simulated platform. Using Generative Machine Learning as a surrogate presents a candidate for overcoming this challenge. In this work, we evaluate the usage of three different Machine Learning models for the simulation of distributed computing systems and assess their ability to generalize to unseen situations. We show that those models can predict central observables derived from execution traces of compute jobs with approximate accuracy but with orders of magnitude faster execution times. Furthermore, we identify potentials for improving the predictions towards better accuracy and generalizability.

Surrogate Modeling for Scalable Evaluation of Distributed Computing Systems for HEP Applications

TL;DR

This work addresses scalable evaluation of WLCG-style distributed systems by training ML surrogate models on DCSim outputs to predict per-job observables from platform, datasets, and workload configurations. It compares BiGRU, BiLSTM, and Transformer encoders using sequence-to-sequence prediction with MSE loss, achieving orders-of-magnitude faster predictions while preserving key dynamics, especially for compute_time in heterogeneous workloads. Limitations arise from insufficient platform context in inputs, as transfer-time predictions struggle and some distribution features are not fully captured; future work includes adding platform descriptors and real-world data to improve generalization. Overall, the approach enables rapid exploration of infrastructure designs at scales infeasible with direct simulation, with potential impact on planning and optimization in the WLCG ecosystem.

Abstract

The Worldwide LHC Computing Grid (WLCG) provides the robust computing infrastructure essential for the LHC experiments by integrating global computing resources into a cohesive entity. Simulations of different compute models present a feasible approach for evaluating future adaptations that are able to cope with future increased demands. However, running these simulations incurs a trade-off between accuracy and scalability. For example, while the simulator DCSim can provide accurate results, it falls short on scaling with the size of the simulated platform. Using Generative Machine Learning as a surrogate presents a candidate for overcoming this challenge. In this work, we evaluate the usage of three different Machine Learning models for the simulation of distributed computing systems and assess their ability to generalize to unseen situations. We show that those models can predict central observables derived from execution traces of compute jobs with approximate accuracy but with orders of magnitude faster execution times. Furthermore, we identify potentials for improving the predictions towards better accuracy and generalizability.

Paper Structure

This paper contains 10 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Approach overview.
  • Figure 2: Layout of training dataset.
  • Figure 3: Predictions for extrapolation of trained models in the homogenous jobs setup. The input_files_transfer_time is similar for the other models.
  • Figure 4: Predictions for extrapolation of trained models in the heterogeneous jobs setup.