Table of Contents
Fetching ...

TiMePReSt: Time and Memory Efficient Pipeline Parallel DNN Training with Removed Staleness

Ankita Dutta, Nabendu Chaki, Rajat K. De

TL;DR

A mathematical relationship between the number of micro-batches and worker machines, highlighting the variation in version difference is presented, and a mathematical expression has been developed to calculate version differences for various combinations of these two without creating diagrams for all combinations.

Abstract

DNN training is time-consuming and requires efficient multi-accelerator parallelization, where a single training iteration is split over available accelerators. Current approaches often parallelize training using intra-batch parallelization. Combining inter-batch and intra-batch pipeline parallelism is common to further improve training throughput. In this article, we develop a system, called TiMePReSt, that combines them in a novel way which helps to better overlap computation and communication, and limits the amount of communication. The traditional pipeline-parallel training of DNNs maintains similar working principle as sequential or conventional training of DNNs by maintaining consistent weight versions in forward and backward passes of a mini-batch. Thus, it suffers from high GPU memory footprint during training. In this paper, experimental study demonstrates that compromising weight consistency doesn't decrease prediction capability of a parallelly trained DNN. Moreover, TiMePReSt overcomes GPU memory overhead and achieves zero weight staleness. State-of-the-art techniques often become costly in terms of training time. In order to address this issue, TiMePReSt introduces a variant of intra-batch parallelism that parallelizes the forward pass of each mini-batch by decomposing it into smaller micro-batches. A novel synchronization method between forward and backward passes reduces training time in TiMePReSt. The occurrence of multiple sequence problem and its relation with version difference have been observed in TiMePReSt. This paper presents a mathematical relationship between the number of micro-batches and worker machines, highlighting the variation in version difference. A mathematical expression has been developed to calculate version differences for various combinations of these two without creating diagrams for all combinations.

TiMePReSt: Time and Memory Efficient Pipeline Parallel DNN Training with Removed Staleness

TL;DR

A mathematical relationship between the number of micro-batches and worker machines, highlighting the variation in version difference is presented, and a mathematical expression has been developed to calculate version differences for various combinations of these two without creating diagrams for all combinations.

Abstract

DNN training is time-consuming and requires efficient multi-accelerator parallelization, where a single training iteration is split over available accelerators. Current approaches often parallelize training using intra-batch parallelization. Combining inter-batch and intra-batch pipeline parallelism is common to further improve training throughput. In this article, we develop a system, called TiMePReSt, that combines them in a novel way which helps to better overlap computation and communication, and limits the amount of communication. The traditional pipeline-parallel training of DNNs maintains similar working principle as sequential or conventional training of DNNs by maintaining consistent weight versions in forward and backward passes of a mini-batch. Thus, it suffers from high GPU memory footprint during training. In this paper, experimental study demonstrates that compromising weight consistency doesn't decrease prediction capability of a parallelly trained DNN. Moreover, TiMePReSt overcomes GPU memory overhead and achieves zero weight staleness. State-of-the-art techniques often become costly in terms of training time. In order to address this issue, TiMePReSt introduces a variant of intra-batch parallelism that parallelizes the forward pass of each mini-batch by decomposing it into smaller micro-batches. A novel synchronization method between forward and backward passes reduces training time in TiMePReSt. The occurrence of multiple sequence problem and its relation with version difference have been observed in TiMePReSt. This paper presents a mathematical relationship between the number of micro-batches and worker machines, highlighting the variation in version difference. A mathematical expression has been developed to calculate version differences for various combinations of these two without creating diagrams for all combinations.

Paper Structure

This paper contains 20 sections, 25 equations, 16 figures.

Table of Contents

  1. Introduction
  2. Background
  3. Data Parallelism
  4. Model Parallelism
  5. Related Works
  6. Methodology
  7. Model Architecture
  8. Solution: In order to overcome the above mentioned bottleneck of the state-of-the-art pipeline parallelism based methods for training DNNs, we introduce a pipeline-based methodology where each mini-batch backpropagates gradients of the prediction error with respect to the latest updated version of weights rather than the version that was considered during forward propagation (prediction). We have already discussed earlier regarding an obvious scenario in pipeline parallelism. During forward propagation of a mini-batch, other mini-batch(s) may complete backpropagation and update the weights. This scenario can be beneficial if and only if the updated weights can be utilized in the upcoming forward and backward passes. The existing pipeline parallelism based DNN training strategies cannot fully utilize the benefit, but TiMePReSt can do it using the proposed strategy. More precisely, TiMePReSt does not allow computing gradients on stale weights narayanan2019pipedream.
  9. Solution: In order to maintain a harmony between mini-batch size and training time, TiMePReSt divides a larger mini-batch into smaller micro-batches and performs their forward passes in the pipelining manner. However, their backward passes does not start immediately, unlike conventional way of training and asynchronous pipeline parallelism techniques. Once all the micro-batches corresponding to a mini-batch complete their forward passes, the backward pass starts for the mini-batch considering the average prediction error or loss of all the micro-batches. For example, in Figure \ref{['fig:N = 2']}, mini-batch 1 is divided into two micro-batches namely 1A and 1B. The backward pass of mini-batch 1 starts once the forward passes of both 1A and 1B complete. It happens for the other mini-batches too. The proposed strategy ensures getting effect of a mini-batch without processing entire mini-batch at a time.
  10. Work Scheduling
  11. Checkpointing
  12. Multiple Sequence Problem
  13. Evaluation
  14. Time needed to achieve target accuracy: We compare TiMePReSt and PipeDream time-to-accuracy for VGG-16 on CIFAR-100 and Tiny-ImageNet-200 image classification datasets using a cluster consisting of two machines having single GPU each. One is NVIDIA Quadro RTX 6000 with 24 GB of GPU memory, another is NVIDIA GeForce RTX 2080 with 12 GB of GPU memory. Figures \ref{['fig:Top-1 accuracy to time_cifar100']} and \ref{['fig:Top-5 accuracy to time_cifar100']} show that TiMePReSt reaches target top-1 and top-5 accuracy much faster than PipeDream respectively, in case of CIFAR-100, whereas Figures \ref{['fig:Top-1 accuracy to time_tiny_imagenet']} and \ref{['fig:Top-5 accuracy to time_tiny_imagenet']} show similar results for Tiny-ImageNet-200. With no compromise with mini-batch size, introducing extra level of intra-batch parallelism in forward pass and limiting the frequency of backward passes are the main factors behind achieving the time-efficiency over PipeDream.
  15. Accuracy reached after equal execution time: TiMePReSt achieves better top-1 and top-5 accuracies higher than PipeDream, after equal training time of VGG-16 on both CIFAR-100 and Tiny-ImageNet-200. Figures \ref{['fig:Top-1 accuracy to time_cifar100']}, \ref{['fig:Top-5 accuracy to time_cifar100']}; and Figures \ref{['fig:Top-1 accuracy to time_tiny_imagenet']}, \ref{['fig:Top-5 accuracy to time_tiny_imagenet']} show this comparison as the VGG-16 network is trained over time.
  16. ...and 5 more sections

Figures (16)

  • Figure 1: Different Model Synchronization Strategies in Data Parallelism
  • Figure 2: Scenarios where different parallelism techniques needed. Model Parallelism is preferred when to train a complex DNN with a number of parameters and complex architecture. Data parallelism is used to handle tremendous size of data.
  • Figure 3: An example Tensor Parallelism with three worker nodes. A DNN with four layers are distributed across the nodes based on intra-layer partitions.
  • Figure 4: An example Pipeline Parallelism with three worker nodes. A DNN with four layers are distributed layer-wise across the nodes. Each mini-batch of input data is passed through all the consecutive stages.
  • Figure 5: An example of PipeDream parallel training with four workers. Each mini-batch maintains the same weight versions on both forward and backward passes.
  • ...and 11 more figures