CkIO: Parallel File Input for Over-Decomposed Task-Based Systems

Mathew Jacob; Maya Taylor; Laxmikant Kale

CkIO: Parallel File Input for Over-Decomposed Task-Based Systems

Mathew Jacob, Maya Taylor, Laxmikant Kale

TL;DR

CkIO addresses a critical bottleneck in overdecomposed, task-based HPC systems by decoupling file-input from application tasks through a buffer-chare intermediary and a read-session prefetching strategy. It employs a two-phase, asynchronous, callback-centric API built on Charm++ to overlap I/O with computation and to support migratability of data-consumer tasks. The solution demonstrates competitive I/O throughput relative to MPI-IO, along with significant overlap and mobility benefits, and yields practical performance improvements (e.g., about 2x speedup in ChaNGa) when integrated with real applications. The work lays groundwork for topology-aware placement, splintered I/O, and broader applicability to graph analytics and data-intensive workloads, highlighting the practical impact of efficient input handling in large-scale AMT systems.

Abstract

Parallel input performance issues are often neglected in large scale parallel applications in Computational Science and Engineering. Traditionally, there has been less focus on input performance because either input sizes are small (as in biomolecular simulations) or the time doing input is insignificant compared with the simulation with many timesteps. But newer applications, such as graph algorithms add a premium to file input performance. Additionally, over-decomposed systems, such as Charm++/AMPI, present new challenges in this context in comparison to MPI applications. In the over-decomposition model, naive parallel I/O in which every task makes its own I/O request is impractical. Furthermore, load balancing supported by models such as Charm++/AMPI precludes assumption of data contiguity on individual nodes. We develop a new I/O abstraction to address these issues by separating the decomposition of consumers of input data from that of file-reader tasks that interact with the file system. This enables applications to scale the number of consumers of data without impacting I/O behavior or performance. These ideas are implemented in a new input library, CkIO, that is built on Charm++, which is a well-known task-based and overdecomposed-partitions system. CkIO is configurable via multiple parameters (such as the number of file readers and/or their placement) that can be tuned depending on characteristics of the application, such as file size and number of application objects. Additionally, CkIO input allows for capabilities such as effective overlap of input and application-level computation, as well as load balancing and migration. We describe the relevant challenges in understanding file system behavior and architecture, the design alternatives being explored, and preliminary performance data.

CkIO: Parallel File Input for Over-Decomposed Task-Based Systems

TL;DR

Abstract

CkIO: Parallel File Input for Over-Decomposed Task-Based Systems

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)