Table of Contents
Fetching ...

Cyclic Data Parallelism for Efficient Parallelism of Deep Neural Networks

Louis Fournier, Edouard Oyallon

TL;DR

The proposed Cyclic Data Parallelism is a novel paradigm shifting the execution of the micro-batches from simultaneous to sequential, with a uniform delay, at the cost of a slight gradient delay, which reduces the number of GPUs needed, by sharing GPUs across micro-batches.

Abstract

Training large deep learning models requires parallelization techniques to scale. In existing methods such as Data Parallelism or ZeRO-DP, micro-batches of data are processed in parallel, which creates two drawbacks: the total memory required to store the model's activations peaks at the end of the forward pass, and gradients must be simultaneously averaged at the end of the backpropagation step. We propose Cyclic Data Parallelism, a novel paradigm shifting the execution of the micro-batches from simultaneous to sequential, with a uniform delay. At the cost of a slight gradient delay, the total memory taken by activations is constant, and the gradient communications are balanced during the training step. With Model Parallelism, our technique reduces the number of GPUs needed, by sharing GPUs across micro-batches. Within the ZeRO-DP framework, our technique allows communication of the model states with point-to-point operations rather than a collective broadcast operation. We illustrate the strength of our approach on the CIFAR-10 and ImageNet datasets.

Cyclic Data Parallelism for Efficient Parallelism of Deep Neural Networks

TL;DR

The proposed Cyclic Data Parallelism is a novel paradigm shifting the execution of the micro-batches from simultaneous to sequential, with a uniform delay, at the cost of a slight gradient delay, which reduces the number of GPUs needed, by sharing GPUs across micro-batches.

Abstract

Training large deep learning models requires parallelization techniques to scale. In existing methods such as Data Parallelism or ZeRO-DP, micro-batches of data are processed in parallel, which creates two drawbacks: the total memory required to store the model's activations peaks at the end of the forward pass, and gradients must be simultaneously averaged at the end of the backpropagation step. We propose Cyclic Data Parallelism, a novel paradigm shifting the execution of the micro-batches from simultaneous to sequential, with a uniform delay. At the cost of a slight gradient delay, the total memory taken by activations is constant, and the gradient communications are balanced during the training step. With Model Parallelism, our technique reduces the number of GPUs needed, by sharing GPUs across micro-batches. Within the ZeRO-DP framework, our technique allows communication of the model states with point-to-point operations rather than a collective broadcast operation. We illustrate the strength of our approach on the CIFAR-10 and ImageNet datasets.
Paper Structure (34 sections, 4 equations, 4 figures, 2 tables)

This paper contains 34 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Timeline of executions for Data Parallelism (DP) and the two versions of Cyclic Data Parallelism (CDP), for $N{=}3$ workers.(a) DP. The 3 workers begin executing their forward pass simultaneously in DP, and maintain this synchronization throughout the entire forward-backward pass. (b) CDP-v1. In CDP, the 3 workers begin executing with an equal delay between them (equal to $2$ time steps). For CDP-v1 (see Eq. \ref{['eq:ur_tm1']}), the parameters of the model are updated with a delay constant equal to one training step. (c) CDP-v2. This delay is limited in CDP-v2 (see Eq. \ref{['eq:ur_int']}), by allowing the stages to update and send gradients independently. The communication scheme, balanced across the training step, is indicated. Note that the total complexity of a training step (a forward-backward pass) does not change, but activation memory does not peak in CDP as it does in DP.
  • Figure 2: Comparison between parallelism frameworks with and without using CDP, for $N{=}3$. A device (e.g., a GPU) is represented by a rectangle, and the different micro-batches being computed by the $3$ colors. A model stage requires memory, for the parameters used for computation and for the activations retained awaiting the backward pass, indicated with a colored disk and a black circle. Communications are intra or inter-device (thin or thick arrow), collective or point-to-point (double-headed or single-headed arrow). (a) Single-GPU DP. This setting corresponds to a high-connectivity device with limited memory. We observe a memory reduction of half. (b) Multi-GPU DP. Communications can be drastically reduced when using multiple GPUs with CDP. (c) DP+MP. Both the number of required GPUs and the communications are reduced compared to a standard implementation of MP with DP. Only $N$ GPUs are needed in PP, but they require more activation memory, shown with a thicker circle. (d) ZeRO-DP. The model states needs to be sent or received by only one worker at each time step, instead of the standard broadcast operation of ZeRO-DP.
  • Figure 2.d.: Training loss of a ResNet-50 trained on ImageNet following the learning rules \ref{['eq:ur_t']}, \ref{['eq:ur_tm1']} and \ref{['eq:ur_int']}. Values are averaged over a window of 7 epochs for the sake of readability. The loss of CDP-v1 is significantly higher at the beginning of training, which is not the case for CDP-v2. As parameters converge, the effect of the delay disappears and the three losses show a similar training curve, with a small advantage for both CDP-v1 and CDP-v2. This confirms that the delay in our update rules does not affect convergence, even in realistic settings.
  • Figure 2.d.: Activation memory per worker, when training with $N$ workers on ImageNet with an efficient implementation of DP (full) and a CDP (dashed), on a ResNet-50 and a ViT-B/16. An optimal halving of the parameters is represented in black ('Optimal'). The memory required by a forward-backward pass for one work is first tracked, and parameter memory is removed. The figure is extrapolated by mimicking the total memory used by $N$ workers training on DP (i.e. simultaneously) or CDP (i.e. cyclically), and dividing by $N$. As $N$ increases, the memory required by CDP flattens, to a value lower than DP. This value reaches close to the reduction in half theorized for the ViT-B/16, with $42\%$. The heterogeneity of the layers of the ResNet reduces the effectiveness of CDP, only reaching $30\%$ reduction.