Table of Contents
Fetching ...

In-Situ Techniques on GPU-Accelerated Data-Intensive Applications

Yi Ju, Mingshuai Li, Adalberto Perez, Laura Bellentani, Niclas Jansson, Stefano Markidis, Philipp Schlatter, Erwin Laure

TL;DR

This work investigates in-situ techniques on GPU-accelerated data-intensive HPC applications to mitigate IO bottlenecks and enhance resource utilization. It formalizes synchronous, asynchronous, and hybrid in-situ workflows using adaptor functions and ADIOS2 for cross-language data exchange, under a MPMD resource framework where $p_t = p_o + p_i$, and evaluates them on the Raven HPC system. Through CFD (NEKO) and MD (QE) case studies, it demonstrates that asynchronous in-situ tasks often reduce total runtime and IO traffic by leveraging idle CPU cores on GPU nodes, with the hybrid approach offering advantages when compression is involved. The findings highlight practical guidance for deploying in-situ techniques on heterogeneous GPU-accelerated workloads and point to future work in extending these methods to AI pipelines and dynamic resource management, potentially enabling more frequent checkpoints and richer data analysis without IO penalties.

Abstract

The computational power of High-Performance Computing (HPC) systems is constantly increasing, however, their input/output (IO) performance grows relatively slowly, and their storage capacity is also limited. This unbalance presents significant challenges for applications such as Molecular Dynamics (MD) and Computational Fluid Dynamics (CFD), which generate massive amounts of data for further visualization or analysis. At the same time, checkpointing is crucial for long runs on HPC clusters, due to limited walltimes and/or failures of system components, and typically requires the storage of large amount of data. Thus, restricted IO performance and storage capacity can lead to bottlenecks for the performance of full application workflows (as compared to computational kernels without IO). In-situ techniques, where data is further processed while still in memory rather to write it out over the I/O subsystem, can help to tackle these problems. In contrast to traditional post-processing methods, in-situ techniques can reduce or avoid the need to write or read data via the IO subsystem. They offer a promising approach for applications aiming to leverage the full power of large scale HPC systems. In-situ techniques can also be applied to hybrid computational nodes on HPC systems consisting of graphics processing units (GPUs) and central processing units (CPUs). On one node, the GPUs would have significant performance advantages over the CPUs. Therefore, current approaches for GPU-accelerated applications often focus on maximizing GPU usage, leaving CPUs underutilized. In-situ tasks using CPUs to perform data analysis or preprocess data concurrently to the running simulation, offer a possibility to improve this underutilization.

In-Situ Techniques on GPU-Accelerated Data-Intensive Applications

TL;DR

This work investigates in-situ techniques on GPU-accelerated data-intensive HPC applications to mitigate IO bottlenecks and enhance resource utilization. It formalizes synchronous, asynchronous, and hybrid in-situ workflows using adaptor functions and ADIOS2 for cross-language data exchange, under a MPMD resource framework where , and evaluates them on the Raven HPC system. Through CFD (NEKO) and MD (QE) case studies, it demonstrates that asynchronous in-situ tasks often reduce total runtime and IO traffic by leveraging idle CPU cores on GPU nodes, with the hybrid approach offering advantages when compression is involved. The findings highlight practical guidance for deploying in-situ techniques on heterogeneous GPU-accelerated workloads and point to future work in extending these methods to AI pipelines and dynamic resource management, potentially enabling more frequent checkpoints and richer data analysis without IO penalties.

Abstract

The computational power of High-Performance Computing (HPC) systems is constantly increasing, however, their input/output (IO) performance grows relatively slowly, and their storage capacity is also limited. This unbalance presents significant challenges for applications such as Molecular Dynamics (MD) and Computational Fluid Dynamics (CFD), which generate massive amounts of data for further visualization or analysis. At the same time, checkpointing is crucial for long runs on HPC clusters, due to limited walltimes and/or failures of system components, and typically requires the storage of large amount of data. Thus, restricted IO performance and storage capacity can lead to bottlenecks for the performance of full application workflows (as compared to computational kernels without IO). In-situ techniques, where data is further processed while still in memory rather to write it out over the I/O subsystem, can help to tackle these problems. In contrast to traditional post-processing methods, in-situ techniques can reduce or avoid the need to write or read data via the IO subsystem. They offer a promising approach for applications aiming to leverage the full power of large scale HPC systems. In-situ techniques can also be applied to hybrid computational nodes on HPC systems consisting of graphics processing units (GPUs) and central processing units (CPUs). On one node, the GPUs would have significant performance advantages over the CPUs. Therefore, current approaches for GPU-accelerated applications often focus on maximizing GPU usage, leaving CPUs underutilized. In-situ tasks using CPUs to perform data analysis or preprocess data concurrently to the running simulation, offer a possibility to improve this underutilization.
Paper Structure (10 sections, 1 equation, 12 figures, 2 tables)

This paper contains 10 sections, 1 equation, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Illustration of workflow applications with synchronous, asynchronous and hybrid in-situ tasks.
  • Figure 2: Execution time of CPU-based NEKO with synchronous and asynchronous image generation on various numbers of fully used Raven CPU node(s)
  • Figure 3: Execution time of GPU-accelerated NEKO with synchronous image generation on two Raven GPU nodes with full usage of eight GPUs and various numbers of CPU cores
  • Figure 4: Execution time of GPU-accelerated NEKO with asynchronous image generation every 50 simulation steps on two Raven GPU nodes with various CPU cores for NEKO and 16 CPU cores for image generation (left), 16 CPU cores for NEKO and various CPU cores for image generation (middle) and the same number of CPU cores for NEKO and image generation (right). In all cases, all eight GPUs on the GPU nodes are used for NEKO
  • Figure 5: Execution time of GPU-accelerated NEKO with asynchronous image generation every ten simulation steps on two Raven GPU nodes with 16 CPU cores for NEKO and various CPU cores for image generation. All eight GPUs on the GPU nodes are used for NEKO
  • ...and 7 more figures