Table of Contents
Fetching ...

To Stream or Not to Stream: Towards A Quantitative Model for Remote HPC Processing Decisions

Flavio Castro, Weijian Zheng, Joaquin Chung, Ian Foster, Rajkumar Kettimuthu

TL;DR

This paper tackles the challenge of deciding when real-time remote HPC streaming outperforms local or file-based processing for data-intensive scientific workflows. It introduces a quantitative model centered on total completion time $T_{pct}$ and a Streaming Speed Score $SSS$ to capture worst-case transfer latency under congestion, incorporating data generation rate, transfer efficiency, I/O overhead, and remote compute speed. The approach is parameterized by $S_{unit}$, $C$, $R_{local}$, $R_{remote}$, $Bw$, $R_{transfer}$, $ heta$, and $r$, and is validated through controlled experiments and case studies drawn from facilities like APS, FRIB, LCLS-II, and the LHC, including a hypothetical LCLS-II-inspired scenario. Key findings show that streaming can reduce end-to-end completion times by up to 97% under high data rates, but tail latency and congestion can dramatically inflate transfer times, underscoring the need for tail-latency-aware design and measurement. The work provides a practical decision-support framework for facility operators to assess streaming feasibility and optimize data workflows in time-sensitive experiments.

Abstract

Modern scientific instruments generate data at rates that increasingly exceed local compute capabilities and, when paired with the staging and I/O overheads of file-based transfers, also render file-based use of remote HPC resources impractical for time-sensitive analysis and experimental steering. Real-time streaming frameworks promise to reduce latency and improve system efficiency, but lack a principled way to assess their feasibility. In this work, we introduce a quantitative framework and an accompanying Streaming Speed Score to evaluate whether remote high-performance computing (HPC) resources can provide timely data processing compared to local alternatives. Our model incorporates key parameters including data generation rate, transfer efficiency, remote processing power, and file input/output overhead to compute total processing completion time and identify operational regimes where streaming is beneficial. We motivate our methodology with use cases from facilities such as APS, FRIB, LCLS-II, and the LHC, and validate our approach through an illustrative case study based on LCLS-II data. Our measurements show that streaming can achieve up to 97% lower end-to-end completion time than file-based methods under high data rates, while worst-case congestion can increase transfer times by over an order of magnitude, underscoring the importance of tail latency in streaming feasibility decisions.

To Stream or Not to Stream: Towards A Quantitative Model for Remote HPC Processing Decisions

TL;DR

This paper tackles the challenge of deciding when real-time remote HPC streaming outperforms local or file-based processing for data-intensive scientific workflows. It introduces a quantitative model centered on total completion time and a Streaming Speed Score to capture worst-case transfer latency under congestion, incorporating data generation rate, transfer efficiency, I/O overhead, and remote compute speed. The approach is parameterized by , , , , , , , and , and is validated through controlled experiments and case studies drawn from facilities like APS, FRIB, LCLS-II, and the LHC, including a hypothetical LCLS-II-inspired scenario. Key findings show that streaming can reduce end-to-end completion times by up to 97% under high data rates, but tail latency and congestion can dramatically inflate transfer times, underscoring the need for tail-latency-aware design and measurement. The work provides a practical decision-support framework for facility operators to assess streaming feasibility and optimize data workflows in time-sensitive experiments.

Abstract

Modern scientific instruments generate data at rates that increasingly exceed local compute capabilities and, when paired with the staging and I/O overheads of file-based transfers, also render file-based use of remote HPC resources impractical for time-sensitive analysis and experimental steering. Real-time streaming frameworks promise to reduce latency and improve system efficiency, but lack a principled way to assess their feasibility. In this work, we introduce a quantitative framework and an accompanying Streaming Speed Score to evaluate whether remote high-performance computing (HPC) resources can provide timely data processing compared to local alternatives. Our model incorporates key parameters including data generation rate, transfer efficiency, remote processing power, and file input/output overhead to compute total processing completion time and identify operational regimes where streaming is beneficial. We motivate our methodology with use cases from facilities such as APS, FRIB, LCLS-II, and the LHC, and validate our approach through an illustrative case study based on LCLS-II data. Our measurements show that streaming can achieve up to 97% lower end-to-end completion time than file-based methods under high data rates, while worst-case congestion can increase transfer times by over an order of magnitude, underscoring the importance of tail latency in streaming feasibility decisions.

Paper Structure

This paper contains 16 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Data movement approaches for remote data analysis adapted from scistream-hpdc22
  • Figure 2: Maximum transfer time vs load for 0.5 GB transfers with P = 2, 4, and 8 parallel TCP flows: (a) Simultaneous batches show non-linear growth of transfer time above 90 % utilization due to congestion (b) Scheduled batches maintain steady transfer.
  • Figure 3: Cumulative probability distribution of Total transfer time including each file transfer.
  • Figure 4: Comparison of streaming and file-based data transfer performance between the APS Voyager GPFS file system and the ALCF Eagle Lustre file system.