Table of Contents
Fetching ...

Elastic Data Transfer Optimization with Hybrid Reinforcement Learning

Rasman Mubtasim Swargo, Md Arifuzzaman

TL;DR

This work tackles the challenge of maximizing data-transfer throughput over high-speed networks by moving beyond concurrency-only optimization. It introduces LDM, an adaptive framework that jointly optimizes pipelining, concurrency, and parallelism, using infinite pipelining, a heuristic chunk-based parallelism, and a PPO-based concurrency policy trained offline via a lightweight simulator. The offline simulator enables rapid learning, achieving about a 2750x speedup over online training, and the resulting policy delivers up to 9.5x higher throughput than state-of-the-art baselines across diverse datasets and network conditions. LDM also demonstrates fairness and robustness in multi-tenant settings and generalizes across TCP congestion controls, making it practical for real-world, heterogeneous transfer workloads.

Abstract

Modern scientific data acquisition generates petabytes of data that must be transferred to geographically distant computing clusters. Conventional tools either rely on preconfigured sessions, which are difficult to tune for users without domain expertise, or they adaptively optimize only concurrency while ignoring other important parameters. We present \name, an adaptive data transfer method that jointly considers multiple parameters. Our solution incorporates heuristic-based parallelism, infinite pipelining, and a deep reinforcement learning based concurrency optimizer. To make agent training practical, we introduce a lightweight network simulator that reduces training time to less than four minutes and provides a $2750\times$ speedup compared to online training. Experimental evaluation shows that \name consistently outperforms existing methods across diverse datasets, achieving up to 9.5x higher throughput compared to state-of-the-art solutions.

Elastic Data Transfer Optimization with Hybrid Reinforcement Learning

TL;DR

This work tackles the challenge of maximizing data-transfer throughput over high-speed networks by moving beyond concurrency-only optimization. It introduces LDM, an adaptive framework that jointly optimizes pipelining, concurrency, and parallelism, using infinite pipelining, a heuristic chunk-based parallelism, and a PPO-based concurrency policy trained offline via a lightweight simulator. The offline simulator enables rapid learning, achieving about a 2750x speedup over online training, and the resulting policy delivers up to 9.5x higher throughput than state-of-the-art baselines across diverse datasets and network conditions. LDM also demonstrates fairness and robustness in multi-tenant settings and generalizes across TCP congestion controls, making it practical for real-world, heterogeneous transfer workloads.

Abstract

Modern scientific data acquisition generates petabytes of data that must be transferred to geographically distant computing clusters. Conventional tools either rely on preconfigured sessions, which are difficult to tune for users without domain expertise, or they adaptively optimize only concurrency while ignoring other important parameters. We present \name, an adaptive data transfer method that jointly considers multiple parameters. Our solution incorporates heuristic-based parallelism, infinite pipelining, and a deep reinforcement learning based concurrency optimizer. To make agent training practical, we introduce a lightweight network simulator that reduces training time to less than four minutes and provides a speedup compared to online training. Experimental evaluation shows that \name consistently outperforms existing methods across diverse datasets, achieving up to 9.5x higher throughput compared to state-of-the-art solutions.

Paper Structure

This paper contains 33 sections, 10 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Visual comparison of data transfer behaviors under different optimization strategies. Pipelining eliminates control-channel idle gaps, concurrency transfers multiple files simultaneously, and parallelism splits large files. LDM combines all three to maximize throughput.
  • Figure 2: Growth of the TCP congestion window over time for 1 MB and 32 MB files, with and without pipelining in a 35 Gbps - 67 ms RTT testbed.
  • Figure 3: Impact of increasing concurrency on aggregate throughput. Throughput scales with the number of concurrent transfers until system and network bottlenecks begin to limit additional gains.
  • Figure 4: Throughput measured while increasing concurrency for a dataset containing three 2 GB files without parallelism.
  • Figure 5: LDM optimizes concurrency, parallelism, and pipelining without being tuned by the end user.
  • ...and 8 more figures