Automated, Reliable, and Efficient Continental-Scale Replication of 7.3 Petabytes of Climate Simulation Data: A Case Study
Lukasz Lacinski, Lee Liming, Steven Turoscy, Cameron Harr, Kyle Chard, Eli Dart, Paul Durack, Sasha Ames, Forrest M. Hoffman, Ian T. Foster
TL;DR
The paper tackles the challenge of duplicating hundreds of terabytes of CMIP/ESGF climate data across major U.S. sites to improve reliability and accessibility. It describes a largely automated replication workflow built on Globus Transfer and a custom controller that coordinates 2×2291 path transfers LLNL→ALCF and LLNL→OLCF, with failover routing during maintenance. Results show ~7.3 PB moved to each destination over ~77 days at average rates near 1.5 GB/s per link and peaks above 7.5 GB/s, despite LLNL GPFS bottlenecks. The work yields actionable lessons for large-scale data distribution and underlines the value of an ESGF-wide data replication fabric for CMIP7 and beyond.
Abstract
We report on our experiences replicating 7.3 petabytes (PB) of Earth System Grid Federation (ESGF) climate simulation data from Lawrence Livermore National Laboratory (LLNL) in California to Argonne National Laboratory (ANL) in Illinois and Oak Ridge National Laboratory (ORNL) in Tennessee. This movement of some 29 million files, twice, undertaken in order to establish new ESGF nodes at ANL and ORNL, was performed largely automatically by a simple replication tool, a script that invoked Globus to transfer large bundles of files while tracking progress in a database. Under the covers, Globus organized transfers to make efficient use of the high-speed Energy Sciences network (ESnet) and the data transfer nodes deployed at participating sites, and also addressed security, integrity checking, and recovery from a variety of transient failures. This success demonstrates the considerable benefits that can accrue from the adoption of performant data replication infrastructure.
