Table of Contents
Fetching ...

Model Parallelism on Distributed Infrastructure: A Literature Review from Theory to LLM Case-Studies

Felix Brakel, Uraz Odyurt, Ana-Lucia Varbanescu

TL;DR

This survey answers three research questions about how neural networks can be parallelised and expressing these as operator graphs while exploring the available dimensions by looking at how parallelism is applied in modern multi-billion parameter transformer networks.

Abstract

Neural networks have become a cornerstone of machine learning. As the trend for these to get more and more complex continues, so does the underlying hardware and software infrastructure for training and deployment. In this survey we answer three research questions: "What types of model parallelism exist?", "What are the challenges of model parallelism?", and "What is a modern use-case of model parallelism?" We answer the first question by looking at how neural networks can be parallelised and expressing these as operator graphs while exploring the available dimensions. The dimensions along which neural networks can be parallelised are intra-operator and inter-operator. We answer the second question by collecting and listing both implementation challenges for the types of parallelism, as well as the problem of optimally partitioning the operator graph. We answer the last question by collecting and listing how parallelism is applied in modern multi-billion parameter transformer networks, to the extend that this is possible with the limited information shared about these networks.

Model Parallelism on Distributed Infrastructure: A Literature Review from Theory to LLM Case-Studies

TL;DR

This survey answers three research questions about how neural networks can be parallelised and expressing these as operator graphs while exploring the available dimensions by looking at how parallelism is applied in modern multi-billion parameter transformer networks.

Abstract

Neural networks have become a cornerstone of machine learning. As the trend for these to get more and more complex continues, so does the underlying hardware and software infrastructure for training and deployment. In this survey we answer three research questions: "What types of model parallelism exist?", "What are the challenges of model parallelism?", and "What is a modern use-case of model parallelism?" We answer the first question by looking at how neural networks can be parallelised and expressing these as operator graphs while exploring the available dimensions. The dimensions along which neural networks can be parallelised are intra-operator and inter-operator. We answer the second question by collecting and listing both implementation challenges for the types of parallelism, as well as the problem of optimally partitioning the operator graph. We answer the last question by collecting and listing how parallelism is applied in modern multi-billion parameter transformer networks, to the extend that this is possible with the limited information shared about these networks.
Paper Structure (27 sections, 13 equations, 6 figures, 3 tables)

This paper contains 27 sections, 13 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Megatron-NLG compared to other large language models (source: Smith:2022:UDSM).
  • Figure 2: Overview of scaling up NNs within the NN compute infrastructure.
  • Figure 3: Taxonomy of model parallelism for neural networks. In this survey, we distinguish data parallelism from model parallelism, which both fall under parallelism in neural networks.
  • Figure 4: Three representations of a fully connected layer. The schematic representation highlights the connections between the neurons, the tensor representation shows the mathematical operation implementing the layer, and the operator graph shows the data-flow through the network.
  • Figure 5: Example of a possible inter-operator parallelisation strategy for a Transformer layer and the way an activation tensor flows through it.
  • ...and 1 more figures