Table of Contents
Fetching ...

Reexamining Paradigms of End-to-End Data Movement

Chin Fang, Timothy Stitt, Michael J. McManus, Toshio Moriya

TL;DR

The paper argues that end-to-end data movement is limited by the entire data path, not just network bandwidth, and advocates a holistic, co-designed approach using burst buffers and the ZX data mover. It reexamines six prevalent paradigms—latency sensitivity, packet loss and TCP congestion controls, private testing lines, bandwidth, CPU power, and cloud virtualization—and provides empirical evidence from latency-emulation testbeds and real 100 Gbps links showing that storage I/O, software efficiency, and architectural co-design often dominate performance. A key contribution is the demonstration that integrated data-movement appliances, including DPUs-enabled paths, can achieve near-line-rate transfers even in resource-constrained environments, while cloud paths introduce measurable penalties that can be mitigated by co-designed data paths. The work also presents a reproducible methodology and publicly available testbeds, supporting broader adoption of high-performance data movement across edge, core, and cloud platforms.

Abstract

The pursuit of high-performance data transfer often focuses on raw network bandwidth, and international links of 100 Gbps or higher are frequently considered the primary enabler. While necessary, this network-centric view is incomplete, equating provisioned link speeds with practical, sustainable data movement capabilities across the entire edge-to-core spectrum. This paper investigates six common paradigms, from the often-cited constraints of network latency and TCP congestion control algorithms to host-side factors such as CPU performance and virtualization that critically impact data movement workflows. We validated our findings using a latency-emulation-capable testbed for high-speed WAN performance prediction and through extensive production measurements from resource-constrained edge environments to a 100 Gbps operational link connecting Switzerland and California, U.S. These results show that the principal bottlenecks often reside outside the network core, and that a holistic hardware-software co-design ensures consistent performance, whether moving data at 1 Gbps or 100 Gbps and faster. This approach effectively closes the fidelity gap between benchmark results and diverse and complex production environments.

Reexamining Paradigms of End-to-End Data Movement

TL;DR

The paper argues that end-to-end data movement is limited by the entire data path, not just network bandwidth, and advocates a holistic, co-designed approach using burst buffers and the ZX data mover. It reexamines six prevalent paradigms—latency sensitivity, packet loss and TCP congestion controls, private testing lines, bandwidth, CPU power, and cloud virtualization—and provides empirical evidence from latency-emulation testbeds and real 100 Gbps links showing that storage I/O, software efficiency, and architectural co-design often dominate performance. A key contribution is the demonstration that integrated data-movement appliances, including DPUs-enabled paths, can achieve near-line-rate transfers even in resource-constrained environments, while cloud paths introduce measurable penalties that can be mitigated by co-designed data paths. The work also presents a reproducible methodology and publicly available testbeds, supporting broader adoption of high-performance data movement across edge, core, and cloud platforms.

Abstract

The pursuit of high-performance data transfer often focuses on raw network bandwidth, and international links of 100 Gbps or higher are frequently considered the primary enabler. While necessary, this network-centric view is incomplete, equating provisioned link speeds with practical, sustainable data movement capabilities across the entire edge-to-core spectrum. This paper investigates six common paradigms, from the often-cited constraints of network latency and TCP congestion control algorithms to host-side factors such as CPU performance and virtualization that critically impact data movement workflows. We validated our findings using a latency-emulation-capable testbed for high-speed WAN performance prediction and through extensive production measurements from resource-constrained edge environments to a 100 Gbps operational link connecting Switzerland and California, U.S. These results show that the principal bottlenecks often reside outside the network core, and that a holistic hardware-software co-design ensures consistent performance, whether moving data at 1 Gbps or 100 Gbps and faster. This approach effectively closes the fidelity gap between benchmark results and diverse and complex production environments.

Paper Structure

This paper contains 28 sections, 13 figures, 10 tables.

Figures (13)

  • Figure 1: iperf3 latency sweep results obtained using two HPE DL380 Gen 11 server-based appliances (Fig. \ref{['fig:fig2']}) on a latency simulation-capable testbed (max network speed 100 Gbps) established in the former Intel Swindon Lab, Swindon, U.K., in 2024. While default kernel networking settings (OOTB) show severe performance degradation under high latency, proper kernel tuning substantially reduces this penalty.”
  • Figure 2: Bill of Materials (BOM) and component details for the Core (HPE DL380 Gen 11) and Mini (Minisforum MS A2) appliances, demonstrating the vendor- and form-factor-agnostic unified data movement appliance design which achieves consistent performance across implementations.
  • Figure 3: The experience of moving data for most practitioners is typically limited to the source of the river, which corresponds to end-user activities, such as transferring photos and videos from mobile phones and moving spreadsheets from one folder to another. At such low data rates, operations appear simple. This may explain why efficient data transfer has historically received limited attention.
  • Figure 4: A bulk transfer sweep leveraging kTLS (kernel TLS) offload in RHEL 9.6 r36 to evaluate the three default congestion control algorithms (CCAs). BBRv1 did not demonstrate a clear performance benefit over CUBIC or Reno. Incidentally, kTLS did not yield any transfer rate improvement in this case.
  • Figure 5: Subsequent bulk transfer sweeps were performed without kTLS due to the prior degradation. Using the default RHEL 9.6 CCAs, BBRv1 and CUBIC exhibited identical throughput across 4 MiB--1 TiB. Reno showed degradation consistent with older congestion control paradigms.
  • ...and 8 more figures