Data Management System Analysis for Distributed Computing Workloads
Kuan-Chieh Hsu, Sairam Sri Vatsavai, Ozgur O. Kilic, Tatiana Korchuganova, Paul Nilsson, Sankha Dutta, Yihui Ren, David K. Park, Joseph Boudreau, Tasnuva Chowdhury, Shengyu Feng, Raees Khan, Jaehyung Kim, Scott Klasky, Tadashi Maeno, Verena Ingrid Martinez Outschoorn, Norbert Podhorszki, Frédéric Suter, Wei Yang, Yiming Yang, Shinjae Yoo, Alexei Klimentov, Adolfy Hoisie
TL;DR
This work addresses the end-to-end efficiency of ATLAS’s PanDA workload management and Rucio data management when deployed together across a global grid. It proposes a file-level metadata-matching framework to link PanDA jobs with Rucio file transfers, enabling a joint view of data movement and workflow execution. By deriving exact and relaxed mappings, the study reveals inefficiencies such as redundant transfers, staging delays, and site imbalances, and provides case studies demonstrating practical resilience risks. The findings advocate tighter PanDA–Rucio co-design, real-time performance awareness, and adaptive data-placement strategies to improve resource utilization and system resilience in distributed computing environments. It additionally explores synthetic data generation as a path to train optimization algorithms for end-to-end scheduling and data placement.
Abstract
Large-scale international collaborations such as ATLAS rely on globally distributed workflows and data management to process, move, and store vast volumes of data. ATLAS's Production and Distributed Analysis (PanDA) workflow system and the Rucio data management system are each highly optimized for their respective design goals. However, operating them together at global scale exposes systemic inefficiencies, including underutilized resources, redundant or unnecessary transfers, and altered error distributions. Moreover, PanDA and Rucio currently lack shared performance awareness and coordinated, adaptive strategies. This work charts a path toward co-optimizing the two systems by diagnosing data-management pitfalls and prioritizing end-to-end improvements. With the observation of spatially and temporally imbalanced transfer activities, we develop a metadata-matching algorithm that links PanDA jobs and Rucio datasets at the file level, yielding a complete, fine-grained view of data access and movement. Using this linkage, we identify anomalous transfer patterns that violate PanDA's data-centric job-allocation principle. We then outline mitigation strategies for these patterns and highlight opportunities for tighter PanDA-Rucio coordination to improve resource utilization, reduce unnecessary data movement, and enhance overall system resilience.
