Table of Contents
Fetching ...

MinatoLoader: Accelerating Machine Learning Training Through Efficient Data Preprocessing

Rahma Nouaji, Stella Bitchebe, Ricardo Macedo, Oana Balmau

TL;DR

MinatoLoader tackles the data preprocessing bottleneck in ML training by addressing head-of-line blocking caused by per-sample variability in preprocessing times. It introduces a sample-aware load balancer with per-sample timeouts and a multi-queue architecture that decouples fast and slow samples from batch construction, supplemented by a dynamic worker scheduler to keep GPUs busy. The approach yields up to $7.5\times$ faster training and GPU utilization around $90.5\%$, while preserving accuracy across diverse workloads (image segmentation, object detection, and speech recognition) on single-node multi-GPU setups. This work demonstrates practical impact by delivering a general, drop-in replacement for PyTorch DataLoader that scales across configurations without dataset-specific tuning, offering substantial gains for compute-bound preprocessing pipelines in real-world training workflows.

Abstract

Data loaders are used by Machine Learning (ML) frameworks like PyTorch and TensorFlow to apply transformations to data before feeding it into the accelerator. This operation is called data preprocessing. Data preprocessing plays an important role in the ML training workflow because if it is inefficiently pipelined with the training, it can yield high GPU idleness, resulting in important training delays. Unfortunately, existing data loaders turn out to waste GPU resources, with $76\%$ GPU idleness when using the PyTorch data loader, for example. One key source of inefficiency is the variability in preprocessing time across samples within the same dataset. Existing data loaders are oblivious to this variability, and they construct batches without any consideration of slow or fast samples. In this case, the entire batch is delayed by a single slow sample, stalling the training pipeline and resulting in head-of-line blocking. To address these inefficiencies, we present MinatoLoader, a general-purpose data loader for PyTorch that accelerates training and improves GPU utilization. MinatoLoader is designed for a single-server setup, containing multiple GPUs. It continuously prepares data in the background and actively constructs batches by prioritizing fast-to-preprocess samples, while slower samples are processed in parallel. We evaluate MinatoLoader on servers with V100 and A100 GPUs. On a machine with four A100 GPUs, MinatoLoader improves the training time of a wide range of workloads by up to $7.5\times$ ($3.6\times$ on average) over PyTorch DataLoader and Pecan, and up to $3\times$ ($2.2\times$ on average) over DALI. It also increases average GPU utilization from 46.4\% with PyTorch to 90.45\%, while preserving model accuracy and enabling faster convergence.

MinatoLoader: Accelerating Machine Learning Training Through Efficient Data Preprocessing

TL;DR

MinatoLoader tackles the data preprocessing bottleneck in ML training by addressing head-of-line blocking caused by per-sample variability in preprocessing times. It introduces a sample-aware load balancer with per-sample timeouts and a multi-queue architecture that decouples fast and slow samples from batch construction, supplemented by a dynamic worker scheduler to keep GPUs busy. The approach yields up to faster training and GPU utilization around , while preserving accuracy across diverse workloads (image segmentation, object detection, and speech recognition) on single-node multi-GPU setups. This work demonstrates practical impact by delivering a general, drop-in replacement for PyTorch DataLoader that scales across configurations without dataset-specific tuning, offering substantial gains for compute-bound preprocessing pipelines in real-world training workflows.

Abstract

Data loaders are used by Machine Learning (ML) frameworks like PyTorch and TensorFlow to apply transformations to data before feeding it into the accelerator. This operation is called data preprocessing. Data preprocessing plays an important role in the ML training workflow because if it is inefficiently pipelined with the training, it can yield high GPU idleness, resulting in important training delays. Unfortunately, existing data loaders turn out to waste GPU resources, with GPU idleness when using the PyTorch data loader, for example. One key source of inefficiency is the variability in preprocessing time across samples within the same dataset. Existing data loaders are oblivious to this variability, and they construct batches without any consideration of slow or fast samples. In this case, the entire batch is delayed by a single slow sample, stalling the training pipeline and resulting in head-of-line blocking. To address these inefficiencies, we present MinatoLoader, a general-purpose data loader for PyTorch that accelerates training and improves GPU utilization. MinatoLoader is designed for a single-server setup, containing multiple GPUs. It continuously prepares data in the background and actively constructs batches by prioritizing fast-to-preprocess samples, while slower samples are processed in parallel. We evaluate MinatoLoader on servers with V100 and A100 GPUs. On a machine with four A100 GPUs, MinatoLoader improves the training time of a wide range of workloads by up to ( on average) over PyTorch DataLoader and Pecan, and up to ( on average) over DALI. It also increases average GPU utilization from 46.4\% with PyTorch to 90.45\%, while preserving model accuracy and enabling faster convergence.

Paper Structure

This paper contains 61 sections, 2 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Inefficient PyTorch DataLoader pipeline. Slow data samples delay the batch construction process, resulting in GPU under-utilization and poor training performance.
  • Figure 2: Variability in per-sample preprocessing time for image segmentation and object detection workloads. The red dashed lines depict the average preprocessing time across all samples -- 0.5s in (a) and 35ms in (b).
  • Figure 3: CPU and GPU usage of the Object Detection workload when using two heuristics: (a) image size and (b) transformation reordering.
  • Figure 4: Impact of prefetch parameter on training time. Increasing the number of batches pre-fetched does not improve the training in both (a) Pytorch and (b) DALI.
  • Figure 5: MinatoLoader high-level design. It continuously enqueues preprocessed samples into a specified queue based on a load balancer decision. Concurrently, the GPU dequeues preprocessed data for training. We show MinatoLoader for one GPU, but it generalizes to multi-GPU settings.
  • ...and 7 more figures