To Stream or Not to Stream: Towards A Quantitative Model for Remote HPC Processing Decisions

Flavio Castro; Weijian Zheng; Joaquin Chung; Ian Foster; Rajkumar Kettimuthu

To Stream or Not to Stream: Towards A Quantitative Model for Remote HPC Processing Decisions

Flavio Castro, Weijian Zheng, Joaquin Chung, Ian Foster, Rajkumar Kettimuthu

TL;DR

This paper tackles the challenge of deciding when real-time remote HPC streaming outperforms local or file-based processing for data-intensive scientific workflows. It introduces a quantitative model centered on total completion time $T_{pct}$ and a Streaming Speed Score $SSS$ to capture worst-case transfer latency under congestion, incorporating data generation rate, transfer efficiency, I/O overhead, and remote compute speed. The approach is parameterized by $S_{unit}$, $C$, $R_{local}$, $R_{remote}$, $Bw$, $R_{transfer}$, $ heta$, and $r$, and is validated through controlled experiments and case studies drawn from facilities like APS, FRIB, LCLS-II, and the LHC, including a hypothetical LCLS-II-inspired scenario. Key findings show that streaming can reduce end-to-end completion times by up to 97% under high data rates, but tail latency and congestion can dramatically inflate transfer times, underscoring the need for tail-latency-aware design and measurement. The work provides a practical decision-support framework for facility operators to assess streaming feasibility and optimize data workflows in time-sensitive experiments.

Abstract

Modern scientific instruments generate data at rates that increasingly exceed local compute capabilities and, when paired with the staging and I/O overheads of file-based transfers, also render file-based use of remote HPC resources impractical for time-sensitive analysis and experimental steering. Real-time streaming frameworks promise to reduce latency and improve system efficiency, but lack a principled way to assess their feasibility. In this work, we introduce a quantitative framework and an accompanying Streaming Speed Score to evaluate whether remote high-performance computing (HPC) resources can provide timely data processing compared to local alternatives. Our model incorporates key parameters including data generation rate, transfer efficiency, remote processing power, and file input/output overhead to compute total processing completion time and identify operational regimes where streaming is beneficial. We motivate our methodology with use cases from facilities such as APS, FRIB, LCLS-II, and the LHC, and validate our approach through an illustrative case study based on LCLS-II data. Our measurements show that streaming can achieve up to 97% lower end-to-end completion time than file-based methods under high data rates, while worst-case congestion can increase transfer times by over an order of magnitude, underscoring the importance of tail latency in streaming feasibility decisions.

To Stream or Not to Stream: Towards A Quantitative Model for Remote HPC Processing Decisions

TL;DR

Abstract

To Stream or Not to Stream: Towards A Quantitative Model for Remote HPC Processing Decisions

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)