Table of Contents
Fetching ...

AI Surrogate Model for Distributed Computing Workloads

David K. Park, Yihui Ren, Ozgur O. Kilic, Tatiana Korchuganova, Sairam Sri Vatsavai, Joseph Boudreau, Tasnuva Chowdhury, Shengyu Feng, Raees Khan, Jaehyung Kim, Scott Klasky, Tadashi Maeno, Paul Nilsson, Verena Ingrid Martinez Outschoorn, Norbert Podhorszki, Frederic Suter, Wei Yang, Yiming Yang, Shinjae Yoo, Alexei Klimentov, Adolfy Hoisie

TL;DR

This work has collected and processed real-world job submission records, and applied four generative models for tabular data— TVAE, CTAGGAN+, SMOTE, and TabDDPM—to these datasets, thoroughly evaluating their performance, and concludes that the probabilistic-diffusion-model-based TabDDPM is the most suitable generative model for managing job record data.

Abstract

Large-scale international scientific collaborations, such as ATLAS, Belle II, CMS, and DUNE, generate vast volumes of data. These experiments necessitate substantial computational power for varied tasks, including structured data processing, Monte Carlo simulations, and end-user analysis. Centralized workflow and data management systems are employed to handle these demands, but current decision-making processes for data placement and payload allocation are often heuristic and disjointed. This optimization challenge potentially could be addressed using contemporary machine learning methods, such as reinforcement learning, which, in turn, require access to extensive data and an interactive environment. Instead, we propose a generative surrogate modeling approach to address the lack of training data and concerns about privacy preservation. We have collected and processed real-world job submission records, totaling more than two million jobs through 150 days, and applied four generative models for tabular data -- TVAE, CTAGGAN+, SMOTE, and TabDDPM -- to these datasets, thoroughly evaluating their performance. Along with measuring the discrepancy among feature-wise distributions separately, we also evaluate pair-wise feature correlations, distance to closest record, and responses to pre-trained models. Our experiments indicate that SMOTE and TabDDPM can generate similar tabular data, almost indistinguishable from the ground truth. Yet, as a non-learning method, SMOTE ranks the lowest in privacy preservation. As a result, we conclude that the probabilistic-diffusion-model-based TabDDPM is the most suitable generative model for managing job record data.

AI Surrogate Model for Distributed Computing Workloads

TL;DR

This work has collected and processed real-world job submission records, and applied four generative models for tabular data— TVAE, CTAGGAN+, SMOTE, and TabDDPM—to these datasets, thoroughly evaluating their performance, and concludes that the probabilistic-diffusion-model-based TabDDPM is the most suitable generative model for managing job record data.

Abstract

Large-scale international scientific collaborations, such as ATLAS, Belle II, CMS, and DUNE, generate vast volumes of data. These experiments necessitate substantial computational power for varied tasks, including structured data processing, Monte Carlo simulations, and end-user analysis. Centralized workflow and data management systems are employed to handle these demands, but current decision-making processes for data placement and payload allocation are often heuristic and disjointed. This optimization challenge potentially could be addressed using contemporary machine learning methods, such as reinforcement learning, which, in turn, require access to extensive data and an interactive environment. Instead, we propose a generative surrogate modeling approach to address the lack of training data and concerns about privacy preservation. We have collected and processed real-world job submission records, totaling more than two million jobs through 150 days, and applied four generative models for tabular data -- TVAE, CTAGGAN+, SMOTE, and TabDDPM -- to these datasets, thoroughly evaluating their performance. Along with measuring the discrepancy among feature-wise distributions separately, we also evaluate pair-wise feature correlations, distance to closest record, and responses to pre-trained models. Our experiments indicate that SMOTE and TabDDPM can generate similar tabular data, almost indistinguishable from the ground truth. Yet, as a non-learning method, SMOTE ranks the lowest in privacy preservation. As a result, we conclude that the probabilistic-diffusion-model-based TabDDPM is the most suitable generative model for managing job record data.

Paper Structure

This paper contains 19 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: The ATLAS experiment's growing data volume is distributed among computing sites globally.
  • Figure 2: Optimization of data placement and job allocation for distributed computing sites pose challenges to computational resilience and efficiency.
  • Figure 3: Dataset profile and filtering diagram. (a) The feature types (N: numerical; C: categorical) and the number of unique entries (# unique) reflect the merged training and test data. creationtime defines when the job was created. computingsite is where the job is executed. Five dataset-related features exist, consisting of the project name (project), production step (prodstep), dataset type (datatype), number of files (ninputdatafiles, i.e., nfiles), and size of the gross input (inputfilebytes, i.e., size). The first six features are known prior to running the job, but the latter two features, namely jobstatus and workload, are unknown until the job is completely executed, defined as the multiplication of number of cores, Gflop per core, and CPU time used. (b) The diagram shows the gross number of PanDA records collected, followed by filtering operations that reduce down to the training and test sets for the generative models.
  • Figure 4: Comparisons of generative performances based on distributional similarities of individual features. (a) Distinct columns show each of all four numerical features used as training inputs, while individual rows correspond to a model. Black and dotted color lines correspond to ground truth (GT) and synthetic data, respectively. (b) The graphs are comparing if distributions are similar for unique entries with top counts across four categorical features.
  • Figure 5: Correlations between features in tabular data. (a) Correlation strengths in ground truth training data are shown. (b) Synthetic data correlations are compared across implemented models on tabular generative models. The bottom row shows the difference versus the ground truth.