Table of Contents
Fetching ...

Hybrid Dual-Batch and Cyclic Progressive Learning for Efficient Distributed Training

Kuan-Wei Lu, Ding-Yong Hong, Pangfeng Liu, Jan-Jan Wu

TL;DR

The paper tackles the efficiency-generalization gap in distributed deep learning by proposing dual-batch learning, which trains with two batch sizes $B_L$ and $B_S$ on a parameter-server, and cyclic progressive learning, which schedules training with progressively increasing image resolutions. The hybrid scheme combines both approaches, balancing throughput and gradient diversity while dynamically adapting batch sizes, resolutions, and learning rates across training stages. A model-update factor based on data distribution between large- and small-batch workers and a memory-based method to auto-determine $B_{max}$ enable scalable, hardware-aware training. Experiments on CIFAR-100 and ImageNet using ResNet-18 show up to 34.8% training-time reduction with comparable or improved accuracy, highlighting practical gains for large-scale CNNs.

Abstract

Distributed machine learning is critical for training deep learning models on large datasets with numerous parameters. Current research primarily focuses on leveraging additional hardware resources and powerful computing units to accelerate the training process. As a result, larger batch sizes are often employed to speed up training. However, training with large batch sizes can lead to lower accuracy due to poor generalization. To address this issue, we propose the dual-batch learning scheme, a distributed training method built on the parameter server framework. This approach maximizes training efficiency by utilizing the largest batch size that the hardware can support while incorporating a smaller batch size to enhance model generalization. By using two different batch sizes simultaneously, this method improves accuracy with minimal additional training time. Additionally, to mitigate the time overhead caused by dual-batch learning, we propose the cyclic progressive learning scheme. This technique repeatedly and gradually increases image resolution from low to high during training, thereby reducing training time. By combining cyclic progressive learning with dual-batch learning, our hybrid approach improves both model generalization and training efficiency. Experimental results with ResNet-18 demonstrate that, compared to conventional training methods, our approach improves accuracy by 3.3% while reducing training time by 10.1% on CIFAR-100, and further achieves a 34.8% reduction in training time on ImageNet.

Hybrid Dual-Batch and Cyclic Progressive Learning for Efficient Distributed Training

TL;DR

The paper tackles the efficiency-generalization gap in distributed deep learning by proposing dual-batch learning, which trains with two batch sizes and on a parameter-server, and cyclic progressive learning, which schedules training with progressively increasing image resolutions. The hybrid scheme combines both approaches, balancing throughput and gradient diversity while dynamically adapting batch sizes, resolutions, and learning rates across training stages. A model-update factor based on data distribution between large- and small-batch workers and a memory-based method to auto-determine enable scalable, hardware-aware training. Experiments on CIFAR-100 and ImageNet using ResNet-18 show up to 34.8% training-time reduction with comparable or improved accuracy, highlighting practical gains for large-scale CNNs.

Abstract

Distributed machine learning is critical for training deep learning models on large datasets with numerous parameters. Current research primarily focuses on leveraging additional hardware resources and powerful computing units to accelerate the training process. As a result, larger batch sizes are often employed to speed up training. However, training with large batch sizes can lead to lower accuracy due to poor generalization. To address this issue, we propose the dual-batch learning scheme, a distributed training method built on the parameter server framework. This approach maximizes training efficiency by utilizing the largest batch size that the hardware can support while incorporating a smaller batch size to enhance model generalization. By using two different batch sizes simultaneously, this method improves accuracy with minimal additional training time. Additionally, to mitigate the time overhead caused by dual-batch learning, we propose the cyclic progressive learning scheme. This technique repeatedly and gradually increases image resolution from low to high during training, thereby reducing training time. By combining cyclic progressive learning with dual-batch learning, our hybrid approach improves both model generalization and training efficiency. Experimental results with ResNet-18 demonstrate that, compared to conventional training methods, our approach improves accuracy by 3.3% while reducing training time by 10.1% on CIFAR-100, and further achieves a 34.8% reduction in training time on ImageNet.

Paper Structure

This paper contains 28 sections, 7 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Illustration of sharp and flat minima in the loss landscape. A sharp minimum leads to higher testing loss and poor generalization, while a flat minimum causes a small increase in loss, thereby improving generalization keskar2017on.
  • Figure 2: Architecture of the parameter server framework for distributed deep learning. The server maintains the global parameters, while multiple workers perform local training and send back the updated parameters.
  • Figure 3: Training time per batch for different batch sizes in PyTorch and TensorFlow. The results highlight the linear relationship between batch size and training time.
  • Figure 4: Training time per epoch for different batch sizes. Larger batch sizes improve GPU utilization and reduce training time, as validated by experimental results.
  • Figure 5: Comparison of testing loss under three conditions: ${d_S}/{d_L}$, $\sqrt{{d_S}/{d_L}}$, and without a model-update factor.
  • ...and 8 more figures