JANUS: Resilient and Adaptive Data Transmission for Enabling Timely and Efficient Cross-Facility Scientific Workflows
Vladislav Esaulov, Jieyang Chen, Norbert Podhorszki, Fred Suter, Scott Klasky, Anu G Bourgeois, Lipeng Wan
TL;DR
The paper tackles the challenge of transferring massive scientific datasets across wide-area networks for cross-facility workflows, where TCP-based transfers incur high latency and inefficiency. It introduces JANUS, a UDP-based, erasure-coded, error-bounded lossy compression pipeline that leverages multilevel data refactoring (pMGARD) to enable progressive reconstruction and adjustable redundancy. It formalizes two optimization models to minimize transmission time under error bounds and to minimize error under time constraints, and develops adaptive transfer protocols that dynamically adjust erasure coding in real time. Through extensive simulations and real-network experiments with Nyx-derived data, JANUS demonstrates superior transfer efficiency and data fidelity, offering a practical path toward timely, scalable data sharing in large-scale scientific collaborations.
Abstract
In modern science, the growing complexity of large-scale scientific projects has led to an increasing reliance on cross-facility scientific workflows, where resources and expertise from multiple institutions and geographic locations are leveraged to accelerate scientific discovery. These workflows often require transmitting huge amounts of scientific data through wide-area networks. Although high-speed networks like ESnet and transfer services such as Globus have improved data mobility, several challenges remain. The sheer volume of data can overwhelm network bandwidth, widely used transport protocols such as TCP suffer from inefficiencies due to retransmissions triggered by packet loss, and existing fault-tolerance mechanisms like erasure coding introduce substantial overhead. In this paper, we propose JANUS, a resilient and adaptable data transmission approach designed for cross-facility scientific workflows. Unlike traditional TCP-based methods, JANUSleverages UDP, integrates erasure coding for fault tolerance, and combines it with error-bounded lossy compression to reduce overhead. This novel design allows users to balance data transmission time and accuracy, optimizing transfer performance based on specific scientific requirements. Additionally, JANUS dynamically adjusts erasure coding parameters in response to real-time network conditions, ensuring efficient data transfers even in fluctuating environments. We develop optimization models for determining ideal configurations and implement adaptive data transfer protocols to enhance reliability. Through extensive simulations and real-network experiments, we demonstrate that JANUS significantly improves transfer efficiency while maintaining data fidelity.
