Table of Contents
Fetching ...

CkIO: Parallel File Input for Over-Decomposed Task-Based Systems

Mathew Jacob, Maya Taylor, Laxmikant Kale

TL;DR

CkIO addresses a critical bottleneck in overdecomposed, task-based HPC systems by decoupling file-input from application tasks through a buffer-chare intermediary and a read-session prefetching strategy. It employs a two-phase, asynchronous, callback-centric API built on Charm++ to overlap I/O with computation and to support migratability of data-consumer tasks. The solution demonstrates competitive I/O throughput relative to MPI-IO, along with significant overlap and mobility benefits, and yields practical performance improvements (e.g., about 2x speedup in ChaNGa) when integrated with real applications. The work lays groundwork for topology-aware placement, splintered I/O, and broader applicability to graph analytics and data-intensive workloads, highlighting the practical impact of efficient input handling in large-scale AMT systems.

Abstract

Parallel input performance issues are often neglected in large scale parallel applications in Computational Science and Engineering. Traditionally, there has been less focus on input performance because either input sizes are small (as in biomolecular simulations) or the time doing input is insignificant compared with the simulation with many timesteps. But newer applications, such as graph algorithms add a premium to file input performance. Additionally, over-decomposed systems, such as Charm++/AMPI, present new challenges in this context in comparison to MPI applications. In the over-decomposition model, naive parallel I/O in which every task makes its own I/O request is impractical. Furthermore, load balancing supported by models such as Charm++/AMPI precludes assumption of data contiguity on individual nodes. We develop a new I/O abstraction to address these issues by separating the decomposition of consumers of input data from that of file-reader tasks that interact with the file system. This enables applications to scale the number of consumers of data without impacting I/O behavior or performance. These ideas are implemented in a new input library, CkIO, that is built on Charm++, which is a well-known task-based and overdecomposed-partitions system. CkIO is configurable via multiple parameters (such as the number of file readers and/or their placement) that can be tuned depending on characteristics of the application, such as file size and number of application objects. Additionally, CkIO input allows for capabilities such as effective overlap of input and application-level computation, as well as load balancing and migration. We describe the relevant challenges in understanding file system behavior and architecture, the design alternatives being explored, and preliminary performance data.

CkIO: Parallel File Input for Over-Decomposed Task-Based Systems

TL;DR

CkIO addresses a critical bottleneck in overdecomposed, task-based HPC systems by decoupling file-input from application tasks through a buffer-chare intermediary and a read-session prefetching strategy. It employs a two-phase, asynchronous, callback-centric API built on Charm++ to overlap I/O with computation and to support migratability of data-consumer tasks. The solution demonstrates competitive I/O throughput relative to MPI-IO, along with significant overlap and mobility benefits, and yields practical performance improvements (e.g., about 2x speedup in ChaNGa) when integrated with real applications. The work lays groundwork for topology-aware placement, splintered I/O, and broader applicability to graph analytics and data-intensive workloads, highlighting the practical impact of efficient input handling in large-scale AMT systems.

Abstract

Parallel input performance issues are often neglected in large scale parallel applications in Computational Science and Engineering. Traditionally, there has been less focus on input performance because either input sizes are small (as in biomolecular simulations) or the time doing input is insignificant compared with the simulation with many timesteps. But newer applications, such as graph algorithms add a premium to file input performance. Additionally, over-decomposed systems, such as Charm++/AMPI, present new challenges in this context in comparison to MPI applications. In the over-decomposition model, naive parallel I/O in which every task makes its own I/O request is impractical. Furthermore, load balancing supported by models such as Charm++/AMPI precludes assumption of data contiguity on individual nodes. We develop a new I/O abstraction to address these issues by separating the decomposition of consumers of input data from that of file-reader tasks that interact with the file system. This enables applications to scale the number of consumers of data without impacting I/O behavior or performance. These ideas are implemented in a new input library, CkIO, that is built on Charm++, which is a well-known task-based and overdecomposed-partitions system. CkIO is configurable via multiple parameters (such as the number of file readers and/or their placement) that can be tuned depending on characteristics of the application, such as file size and number of application objects. Additionally, CkIO input allows for capabilities such as effective overlap of input and application-level computation, as well as load balancing and migration. We describe the relevant challenges in understanding file system behavior and architecture, the design alternatives being explored, and preliminary performance data.

Paper Structure

This paper contains 33 sections, 12 figures.

Figures (12)

  • Figure 1: Naive overdecomposed input in Charm++. Results were produced on Bridges2 using 16 nodes and 512 PEs, where each data point denotes an average over 10 runs.
  • Figure 2: This graph compares the time to read data from the filesystem and sending that data across nodes. The x-axis denotes the size of the file being read and sent over the network, while the y-axis is the time.
  • Figure 3: Schematics of (a) naive parallel input vs (b) input with CkIO. In the naive implementation, application chares interact directly with the file system. With CkIO, a layer of buffer chares is used to abstract the file system interaction away.
  • Figure 4: Performance of naive parallel input (where each client directly makes file-system calls) vs input with CkIO reading from a single 4GB file on Bridges2 (16 nodes, 32 tasks per node). As the number of clients vary, CkIO provides consistent performance comparable to the optimal input performance. The vertical bars indicate variability due to file system and compute node contention.
  • Figure 5: Diagram of the CkIO system architecture. Note that the Buffer Chares begin reading on session instantiation, without waiting for client requests. Additionally, the ReadAssemblers are created on instantiation but are not yet active.
  • ...and 7 more figures