Table of Contents
Fetching ...

Dual-pronged deep learning preprocessing on heterogeneous platforms with CPU, Accelerator and CSD

Jia Wei, Xingjun Zhang, Witold Pedrycz, Longxiang Wang, Jie Zhao

TL;DR

DDLP tackles the pervasive data preprocessing bottleneck in image-focused deep learning by distributing preprocessing across CPUs and Computable Storage Devices (CSDs) and overlapping this work with accelerator training. It introduces two adaptive strategies, Moving Towards Each Other (MTE) and Weighted Round Robin (WRR), to coordinate data flow from both dataset ends and to enable direct SSD-to-accelerator transfers via GDS. The authors provide theoretical and empirical validation across ImageNet and CIFAR-10 on GPUs and DSAs, showing up to ~23% speedup and substantial reductions in energy and CPU/DRAM usage, with complementary gains when combined with NVIDIA DALI. The work demonstrates the practical viability of co-designing preprocessing on heterogeneous platforms and offers actionable guidance for reducing preprocessing overhead in large-scale DL training.

Abstract

For image-related deep learning tasks, the first step often involves reading data from external storage and performing preprocessing on the CPU. As accelerator speed increases and the number of single compute node accelerators increases, the computing and data transfer capabilities gap between accelerators and CPUs gradually increases. Data reading and preprocessing become progressively the bottleneck of these tasks. Our work, DDLP, addresses the data computing and transfer bottleneck of deep learning preprocessing using Computable Storage Devices (CSDs). DDLP allows the CPU and CSD to efficiently parallelize preprocessing from both ends of the datasets, respectively. To this end, we propose two adaptive dynamic selection strategies to make DDLP control the accelerator to automatically read data from different sources. The two strategies trade-off between consistency and efficiency. DDLP achieves sufficient computational overlap between CSD data preprocessing and CPU preprocessing, accelerator computation, and accelerator data reading. In addition, DDLP leverages direct storage technology to enable efficient SSD-to-accelerator data transfer. In addition, DDLP reduces the use of expensive CPU and DRAM resources with more energy-efficient CSDs, alleviating preprocessing bottlenecks while significantly reducing power consumption. Extensive experimental results show that DDLP can improve learning speed by up to 23.5% on ImageNet Dataset while reducing energy consumption by 19.7% and CPU and DRAM usage by 37.6%. DDLP also improves the learning speed by up to 27.6% on the Cifar-10 dataset.

Dual-pronged deep learning preprocessing on heterogeneous platforms with CPU, Accelerator and CSD

TL;DR

DDLP tackles the pervasive data preprocessing bottleneck in image-focused deep learning by distributing preprocessing across CPUs and Computable Storage Devices (CSDs) and overlapping this work with accelerator training. It introduces two adaptive strategies, Moving Towards Each Other (MTE) and Weighted Round Robin (WRR), to coordinate data flow from both dataset ends and to enable direct SSD-to-accelerator transfers via GDS. The authors provide theoretical and empirical validation across ImageNet and CIFAR-10 on GPUs and DSAs, showing up to ~23% speedup and substantial reductions in energy and CPU/DRAM usage, with complementary gains when combined with NVIDIA DALI. The work demonstrates the practical viability of co-designing preprocessing on heterogeneous platforms and offers actionable guidance for reducing preprocessing overhead in large-scale DL training.

Abstract

For image-related deep learning tasks, the first step often involves reading data from external storage and performing preprocessing on the CPU. As accelerator speed increases and the number of single compute node accelerators increases, the computing and data transfer capabilities gap between accelerators and CPUs gradually increases. Data reading and preprocessing become progressively the bottleneck of these tasks. Our work, DDLP, addresses the data computing and transfer bottleneck of deep learning preprocessing using Computable Storage Devices (CSDs). DDLP allows the CPU and CSD to efficiently parallelize preprocessing from both ends of the datasets, respectively. To this end, we propose two adaptive dynamic selection strategies to make DDLP control the accelerator to automatically read data from different sources. The two strategies trade-off between consistency and efficiency. DDLP achieves sufficient computational overlap between CSD data preprocessing and CPU preprocessing, accelerator computation, and accelerator data reading. In addition, DDLP leverages direct storage technology to enable efficient SSD-to-accelerator data transfer. In addition, DDLP reduces the use of expensive CPU and DRAM resources with more energy-efficient CSDs, alleviating preprocessing bottlenecks while significantly reducing power consumption. Extensive experimental results show that DDLP can improve learning speed by up to 23.5% on ImageNet Dataset while reducing energy consumption by 19.7% and CPU and DRAM usage by 37.6%. DDLP also improves the learning speed by up to 27.6% on the Cifar-10 dataset.
Paper Structure (29 sections, 4 equations, 8 figures, 9 tables, 2 algorithms)

This paper contains 29 sections, 4 equations, 8 figures, 9 tables, 2 algorithms.

Figures (8)

  • Figure 1: Ratio of Data Preprocessing Time to GPU Training Time vs. Number of Processes. Experiments on 19 Torchvision models with ImageNet show a maximum overhead of 60.67× (mean 20.18×) under single‑process reading. While subprocess parallelism reduces this ratio, it remains above 1 in all configurations, confirming a persistent preprocessing bottleneck. See Section \ref{['env']} for experimental details.
  • Figure 2: Schematic Diagram of an example environment with CSD.
  • Figure 3: Deep Learning Process. The classical deep learning process (top) and the DDLP deep learning process (bottom).
  • Figure 4: Schematic Diagram of DDLP Architecture
  • Figure 5: DDLP training process with (a) MTE and (b) WRR.
  • ...and 3 more figures