Table of Contents
Fetching ...

On the Performance and Memory Footprint of Distributed Training: An Empirical Study on Transformers

Zhengxian Lu, Fangyu Wang, Zhiwei Xu, Fei Yang, Tao Li

TL;DR

This work addresses the heavy time and memory demands of distributed Transformer training by developing an analytical framework that decouples training time into $T_{cm}$, $T_{cp}$, $T_{ol}$, and $T_{sd}$ and memory into weights and activations, parameterized by operator and tensor layouts. It systematically analyzes data, tensor, and pipeline parallelism, alongside memory optimization techniques like ZeRO and re-computation, providing theoretical predictions and empirical validation across Bert-base and ResNet workloads. The key contributions include four architectural observations, a comprehensive methodology for performance and memory modeling, and experimental insights showing that pipeline parallelism often yields advantages for Transformers but requires careful management of communication and scheduling, with memory optimizations offering substantial reductions. The results have practical impact by guiding the choice of distributed strategies and optimization techniques to improve scalability and efficiency in Transformer training on large clusters.

Abstract

Transformer models have emerged as potent solutions to a wide array of multidisciplinary challenges. The deployment of Transformer architectures is significantly hindered by their extensive computational and memory requirements, necessitating the reliance on advanced efficient distributed training methodologies. Prior research has delved into the performance bottlenecks associated with distributed training, aiming to unravel these bottlenecks and suggest optimization directions. However, such analyses often overlook three aspects unique to Transformer models: the specialized architecture, the dependency on various distributed strategies, and the requirement to balance computational and memory overhead. This paper aims to bridge this gap by offering a comprehensive examination of the performance bottlenecks inherent in distributed training of Transformer models, leveraging both theoretical analysis and empirical investigation. We propose an analytical framework tailored to these unique aspects of Transformers, facilitating a holistic evaluation of model architectures, distributed strategies, and resource consumption. Based on this analytical framework, we conduct a comparative analysis of theoretical performances and further systematically explore how various distributed training strategies fare in real-world scenarios. Most of the experimental results can be well explained by the analytical outcomes derived from the analytical framework. Notably, our findings suggest an advantage of pipeline parallelism over data parallelism for Transformer models. Moreover, we shed light on some unexpected outcomes, such as the potential for increased total memory overhead due to suboptimal model partitioning within pipeline parallelism. Additionally, we underscore the significance of communication block size and waiting time to further enhance performance.

On the Performance and Memory Footprint of Distributed Training: An Empirical Study on Transformers

TL;DR

This work addresses the heavy time and memory demands of distributed Transformer training by developing an analytical framework that decouples training time into , , , and and memory into weights and activations, parameterized by operator and tensor layouts. It systematically analyzes data, tensor, and pipeline parallelism, alongside memory optimization techniques like ZeRO and re-computation, providing theoretical predictions and empirical validation across Bert-base and ResNet workloads. The key contributions include four architectural observations, a comprehensive methodology for performance and memory modeling, and experimental insights showing that pipeline parallelism often yields advantages for Transformers but requires careful management of communication and scheduling, with memory optimizations offering substantial reductions. The results have practical impact by guiding the choice of distributed strategies and optimization techniques to improve scalability and efficiency in Transformer training on large clusters.

Abstract

Transformer models have emerged as potent solutions to a wide array of multidisciplinary challenges. The deployment of Transformer architectures is significantly hindered by their extensive computational and memory requirements, necessitating the reliance on advanced efficient distributed training methodologies. Prior research has delved into the performance bottlenecks associated with distributed training, aiming to unravel these bottlenecks and suggest optimization directions. However, such analyses often overlook three aspects unique to Transformer models: the specialized architecture, the dependency on various distributed strategies, and the requirement to balance computational and memory overhead. This paper aims to bridge this gap by offering a comprehensive examination of the performance bottlenecks inherent in distributed training of Transformer models, leveraging both theoretical analysis and empirical investigation. We propose an analytical framework tailored to these unique aspects of Transformers, facilitating a holistic evaluation of model architectures, distributed strategies, and resource consumption. Based on this analytical framework, we conduct a comparative analysis of theoretical performances and further systematically explore how various distributed training strategies fare in real-world scenarios. Most of the experimental results can be well explained by the analytical outcomes derived from the analytical framework. Notably, our findings suggest an advantage of pipeline parallelism over data parallelism for Transformer models. Moreover, we shed light on some unexpected outcomes, such as the potential for increased total memory overhead due to suboptimal model partitioning within pipeline parallelism. Additionally, we underscore the significance of communication block size and waiting time to further enhance performance.
Paper Structure (25 sections, 19 equations, 18 figures, 3 tables)

This paper contains 25 sections, 19 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: An illustration of distributed parallel strategies, taking an example of training a 3-layer Multi-layer Perceptron (MLP). "w1", "w2", and "w3" are the weight tensor of matmul operators. The backpropagation in pipeline parallelism is shown as "B" to depict the scheduling.
  • Figure 2: An illustration of ZeRO and re-computation.
  • Figure 3: An illustration shows how our framework analyzes the performance and memory consumption of distributed training.
  • Figure 4: The bandwidth measured by NCCL test (scatter plots) and fitted with sigmoidal curves (lines). "ResNet#1-#3" refers to the three sizes of feature maps in ResNet.
  • Figure 5: The prediction of communication time based on activation sizes and the fitted bandwidth. The percentage error is shown above each prediction.
  • ...and 13 more figures