Table of Contents
Fetching ...

Design and Implementation of an Analysis Pipeline for Heterogeneous Data

Arup Kumar Sarker, Aymen Alsaadi, Niranda Perera, Mills Staylor, Gregor von Laszewski, Matteo Turilli, Ozgur Ozan Kilic, Mikhail Titov, Andre Merzky, Shantenu Jha, Geoffrey Fox

TL;DR

The paper tackles the challenge of integrating data engineering and deep learning workflows on HPC platforms by introducing Radical-Cylon, a heterogeneous runtime that executes Cylon tasks as RADICAL-Pilot jobs to produce private MPI communicators. The loosely coupled RP-Cylon design enables unified resource management across CPUs/GPUs on clouds and supercomputers, delivering comparable or improved performance to bare-metal Cylon, especially at higher parallelism and multi-pipeline workloads. Extensive experiments on UVA Rivanna and ORNL Summit show RP-Cylon scales with small overheads and achieves 4–15% faster execution than batch BM-Cylon in heterogeneous pipelines, with solid performance on datasets ranging from tens of millions to billions of rows. This open-source approach advances reproducible, scalable ML/Data Analytics on HPC, providing a flexible framework for integrating data engineering with DL workloads across diverse platforms.

Abstract

Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. Integrating existing open-source, scalable runtime tools and data frameworks on high-performance computing (HPC) platforms is crucial to address these challenges. Our objective is to establish a smooth and unified method of combining data engineering and deep learning frameworks with diverse execution capabilities that can be deployed on various high-performance computing platforms, including cloud and supercomputers. We aim to support heterogeneous systems with accelerators, where Cylon and other data engineering and deep learning frameworks can utilize heterogeneous execution. To achieve this, we propose Radical-Cylon, a heterogeneous runtime system with a parallel and distributed data framework to execute Cylon as a task of Radical Pilot. We thoroughly explain Radical-Cylon's design and development and the execution process of Cylon tasks using Radical Pilot. This approach enables the use of heterogeneous MPI-communicators across multiple nodes. Radical-Cylon achieves better performance than Bare-Metal Cylon with minimal and constant overhead. Radical-Cylon achieves (4~15)% faster execution time than batch execution while performing similar join and sort operations with 35 million and 3.5 billion rows with the same resources. The approach aims to excel in both scientific and engineering research HPC systems while demonstrating robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community.

Design and Implementation of an Analysis Pipeline for Heterogeneous Data

TL;DR

The paper tackles the challenge of integrating data engineering and deep learning workflows on HPC platforms by introducing Radical-Cylon, a heterogeneous runtime that executes Cylon tasks as RADICAL-Pilot jobs to produce private MPI communicators. The loosely coupled RP-Cylon design enables unified resource management across CPUs/GPUs on clouds and supercomputers, delivering comparable or improved performance to bare-metal Cylon, especially at higher parallelism and multi-pipeline workloads. Extensive experiments on UVA Rivanna and ORNL Summit show RP-Cylon scales with small overheads and achieves 4–15% faster execution than batch BM-Cylon in heterogeneous pipelines, with solid performance on datasets ranging from tens of millions to billions of rows. This open-source approach advances reproducible, scalable ML/Data Analytics on HPC, providing a flexible framework for integrating data engineering with DL workloads across diverse platforms.

Abstract

Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. Integrating existing open-source, scalable runtime tools and data frameworks on high-performance computing (HPC) platforms is crucial to address these challenges. Our objective is to establish a smooth and unified method of combining data engineering and deep learning frameworks with diverse execution capabilities that can be deployed on various high-performance computing platforms, including cloud and supercomputers. We aim to support heterogeneous systems with accelerators, where Cylon and other data engineering and deep learning frameworks can utilize heterogeneous execution. To achieve this, we propose Radical-Cylon, a heterogeneous runtime system with a parallel and distributed data framework to execute Cylon as a task of Radical Pilot. We thoroughly explain Radical-Cylon's design and development and the execution process of Cylon tasks using Radical Pilot. This approach enables the use of heterogeneous MPI-communicators across multiple nodes. Radical-Cylon achieves better performance than Bare-Metal Cylon with minimal and constant overhead. Radical-Cylon achieves (4~15)% faster execution time than batch execution while performing similar join and sort operations with 35 million and 3.5 billion rows with the same resources. The approach aims to excel in both scientific and engineering research HPC systems while demonstrating robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community.
Paper Structure (13 sections, 11 figures, 2 tables)

This paper contains 13 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Cylon Layered Architecture. From the bottom-up view, the Hardware layer is compatible with vendor-based or open-sourced transport layer perera2023depth
  • Figure 2: Cylon Communicator Model. It has cross-platform support of Open-MPI, GLOO and UCX perera2023depth
  • Figure 3: Radical-Cylon Architecture. A modular design with dependent components. Segregated independent module with top-down flow from cross-platform to hardware resources.
  • Figure 4: Heterogeneous Execution with Control and Data Flow. The execution pipeline uses a separate SPMD framework for underlying tasks.
  • Figure 5: Comparison of strong scaling(left) and weak scaling(right) performance of Bare-Metal and Radical-Cylon with join operation on Rivanna. execution time(s) is calculated by running task for 10 iterations. The number of parallelism is calculated by nodes multiple by 37 cores per node
  • ...and 6 more figures