Table of Contents
Fetching ...

Deep RC: A Scalable Data Engineering and Deep Learning Pipeline

Arup Kumar Sarker, Aymen Alsaadi, Alexander James Halpern, Prabhath Tangella, Mikhail Titov, Niranda Perera, Mills Staylor, Gregor von Laszewski, Shantenu Jha, Geoffrey Fox

TL;DR

Deep RC tackles the bottleneck of preparing massive heterogeneous data for deep learning by fusing data engineering, deep learning, and workflow management across HPC and cloud platforms. The approach introduces a modular Deep RC Bridge that connects Cylon-based ETL with PyTorch/TensorFlow DL tasks, using RADICAL-Pilot to orchestrate MPI/GLOO/NCCL communicators across CPU/GPU resources and enabling a zero-copy distributed data loader. Empirical results on hydrology (TensorFlow LSTM) and neural forecasting (NeuralForecast models) show near-parity with bare-metal runs while achieving end-to-end improvements, including reductions of $3.28$ seconds in preprocessing and $75.9$ seconds in training/inference under identical resource conditions. These findings demonstrate a scalable, flexible framework for end-to-end data preprocessing, model training, and postprocessing in diverse scientific domains.

Abstract

Significant obstacles exist in scientific domains including genetics, climate modeling, and astronomy due to the management, preprocess, and training on complicated data for deep learning. Even while several large-scale solutions offer distributed execution environments, open-source alternatives that integrate scalable runtime tools, deep learning and data frameworks on high-performance computing platforms remain crucial for accessibility and flexibility. In this paper, we introduce Deep Radical-Cylon(RC), a heterogeneous runtime system that combines data engineering, deep learning frameworks, and workflow engines across several HPC environments, including cloud and supercomputing infrastructures. Deep RC supports heterogeneous systems with accelerators, allows the usage of communication libraries like MPI, GLOO and NCCL across multi-node setups, and facilitates parallel and distributed deep learning pipelines by utilizing Radical Pilot as a task execution framework. By attaining an end-to-end pipeline including preprocessing, model training, and postprocessing with 11 neural forecasting models (PyTorch) and hydrology models (TensorFlow) under identical resource conditions, the system reduces 3.28 and 75.9 seconds, respectively. The design of Deep RC guarantees the smooth integration of scalable data frameworks, such as Cylon, with deep learning processes, exhibiting strong performance on cloud platforms and scientific HPC systems. By offering a flexible, high-performance solution for resource-intensive applications, this method closes the gap between data preprocessing, model training, and postprocessing.

Deep RC: A Scalable Data Engineering and Deep Learning Pipeline

TL;DR

Deep RC tackles the bottleneck of preparing massive heterogeneous data for deep learning by fusing data engineering, deep learning, and workflow management across HPC and cloud platforms. The approach introduces a modular Deep RC Bridge that connects Cylon-based ETL with PyTorch/TensorFlow DL tasks, using RADICAL-Pilot to orchestrate MPI/GLOO/NCCL communicators across CPU/GPU resources and enabling a zero-copy distributed data loader. Empirical results on hydrology (TensorFlow LSTM) and neural forecasting (NeuralForecast models) show near-parity with bare-metal runs while achieving end-to-end improvements, including reductions of seconds in preprocessing and seconds in training/inference under identical resource conditions. These findings demonstrate a scalable, flexible framework for end-to-end data preprocessing, model training, and postprocessing in diverse scientific domains.

Abstract

Significant obstacles exist in scientific domains including genetics, climate modeling, and astronomy due to the management, preprocess, and training on complicated data for deep learning. Even while several large-scale solutions offer distributed execution environments, open-source alternatives that integrate scalable runtime tools, deep learning and data frameworks on high-performance computing platforms remain crucial for accessibility and flexibility. In this paper, we introduce Deep Radical-Cylon(RC), a heterogeneous runtime system that combines data engineering, deep learning frameworks, and workflow engines across several HPC environments, including cloud and supercomputing infrastructures. Deep RC supports heterogeneous systems with accelerators, allows the usage of communication libraries like MPI, GLOO and NCCL across multi-node setups, and facilitates parallel and distributed deep learning pipelines by utilizing Radical Pilot as a task execution framework. By attaining an end-to-end pipeline including preprocessing, model training, and postprocessing with 11 neural forecasting models (PyTorch) and hydrology models (TensorFlow) under identical resource conditions, the system reduces 3.28 and 75.9 seconds, respectively. The design of Deep RC guarantees the smooth integration of scalable data frameworks, such as Cylon, with deep learning processes, exhibiting strong performance on cloud platforms and scientific HPC systems. By offering a flexible, high-performance solution for resource-intensive applications, this method closes the gap between data preprocessing, model training, and postprocessing.

Paper Structure

This paper contains 14 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Deep RC design incorporating Radical-Cylon sarker2024radical Architecture with GLOO and NCCL as communication framework. A modular design with dependent components. Segregated independent module with top-down flow from cross-platform to hardware resources. Application pipeline is incorporated with Cylon APIs and PyTorch and TensorFlow with proposed Deep RC Bridge in Fig.-\ref{['fig:deep_cylon_design']}
  • Figure 2: Deep RC Bridge. From distributed data pre-processing to deep learning model execution. Cylon distributed Tasks are scheduled and executed on CPUs with MPI/GLOO/UCX communication frameworks.
  • Figure 3: Deep RC Workflow Architecture. From the bottom-up view, the compute node is a Hardware layer that is compatible with vendor-based CPUs and GPUs. Deep Radical-Cylon(RC) tasks consist of data engineering and deep learning jobs executed on multiple execution pipelines.
  • Figure 4: Heterogeneous Executions with sort and join strong and weak scaling(4) operations on rivanna. Strong scaling with 640 parallelism takes a bit more time due to the lack of rows available for each worker and some workers go idle.
  • Figure 5: Training(left) and Prediction(right) accuracy of Precipitation with Camels-US datasets addor2017camels in LSTM Hydrology model.
  • ...and 4 more figures