Modular Architecture for High-Performance and Low Overhead Data Transfers
Rasman Mubtasim Swargo, Engin Arslan, Md Arifuzzaman
TL;DR
The paper addresses the challenge of rapidly moving massive datasets over high-bandwidth, geographically distributed networks where conditions are dynamic. It introduces a modular data transfer architecture (AutoMDT) that jointly optimizes three concurrency dimensions—read, network, and write—via a policy-driven DRL approach (PPO) trained offline using a dedicated memory-buffer dynamics simulator. The approach yields up to 8x faster convergence and up to 68% shorter transfer times compared with state-of-the-art baselines, demonstrated on production-grade testbeds (CloudLab and Fabric). The offline simulator enables rapid training (about 45 minutes) and provides a practical path to stable, high-performance data transfers in real-world HPC environments without modifying kernel or transport-layer configurations.
Abstract
High-performance applications necessitate rapid and dependable transfer of massive datasets across geographically dispersed locations. Traditional file transfer tools often suffer from resource underutilization and instability because of fixed configurations or monolithic optimization methods. We propose AutoMDT, a novel modular data transfer architecture that employs a deep reinforcement learning based agent to simultaneously optimize concurrency levels for read, network, and write operations. Our solution incorporates a lightweight network-system simulator, enabling offline training of a Proximal Policy Optimization (PPO) agent in approximately 45 minutes on average, thereby overcoming the impracticality of lengthy online training in production networks. AutoMDT's modular design decouples I/O and network tasks, allowing the agent to capture complex buffer dynamics precisely and to adapt quickly to changing system and network conditions. Evaluations on production-grade testbeds show that AutoMDT achieves up to 8x faster convergence and a 68% reduction in transfer completion times compared with state-of-the-art solutions.
