Table of Contents
Fetching ...

PaSE: Parallelization Strategies for Efficient DNN Training

Venmugil Elango

TL;DR

PaSE addresses the challenge of automatically discovering efficient parallelization strategies for DNN training by modeling the network as a computation graph and minimizing a cost that balances computation and communication across devices. It introduces a dynamic-programming approach augmented by a vertex-sequencing technique (GenerateSeq) to keep dependent sets small, enabling efficient computation of near-optimal strategies that mix data and parameter parallelism. Through experiments on AlexNet, InceptionV3, RNNLM, and Transformer, PaSE achieves up to 1.85× speedup on 1080Ti GPUs and up to 4× on 2080Ti GPUs over data-parallel baselines, outperforming expert-designed strategies and FlexFlow. The method provides practical automatic strategy discovery without semantic model changes, with a modular cost model and a scalable search that leverages sparse DNN graphs in common architectures.

Abstract

Training a deep neural network (DNN) requires substantial computational and memory requirements. It is common to use multiple devices to train a DNN to reduce the overall training time. There are several choices to parallelize each layer in a DNN. Exhaustively searching this list to find an optimal parallelization strategy is prohibitively time consuming and impractical. The standard practice is to use data parallelism because of its simplicity. However, data parallelism is often sub-optimal, and suffers from poor performance and high memory requirement. Expert-designed strategies have been proposed on a case-by-case basis using domain specific knowledge. These expert-designed strategies do not generalize well to DNNs other than the ones for which they were designed, and are not always necessarily the best choice. In this paper, we propose an approach to automatically find efficient parallelization strategies for DNNs from their computation graphs. We present an efficient algorithm to compute these strategies within a reasonable time in practice. We evaluate the effectiveness of our approach on various DNNs. We also compare the performance of the strategies identified by our approach against data parallelism, expert-designed strategies, and the state-of-the-art approaches. Our results show that the strategies found using our approach outperform the baseline data parallelism strategy in all the cases. In addition, our strategies achieve better performance than the expert-designed strategies and the state-of-the-art approaches.

PaSE: Parallelization Strategies for Efficient DNN Training

TL;DR

PaSE addresses the challenge of automatically discovering efficient parallelization strategies for DNN training by modeling the network as a computation graph and minimizing a cost that balances computation and communication across devices. It introduces a dynamic-programming approach augmented by a vertex-sequencing technique (GenerateSeq) to keep dependent sets small, enabling efficient computation of near-optimal strategies that mix data and parameter parallelism. Through experiments on AlexNet, InceptionV3, RNNLM, and Transformer, PaSE achieves up to 1.85× speedup on 1080Ti GPUs and up to 4× on 2080Ti GPUs over data-parallel baselines, outperforming expert-designed strategies and FlexFlow. The method provides practical automatic strategy discovery without semantic model changes, with a modular cost model and a scalable search that leverages sparse DNN graphs in common architectures.

Abstract

Training a deep neural network (DNN) requires substantial computational and memory requirements. It is common to use multiple devices to train a DNN to reduce the overall training time. There are several choices to parallelize each layer in a DNN. Exhaustively searching this list to find an optimal parallelization strategy is prohibitively time consuming and impractical. The standard practice is to use data parallelism because of its simplicity. However, data parallelism is often sub-optimal, and suffers from poor performance and high memory requirement. Expert-designed strategies have been proposed on a case-by-case basis using domain specific knowledge. These expert-designed strategies do not generalize well to DNNs other than the ones for which they were designed, and are not always necessarily the best choice. In this paper, we propose an approach to automatically find efficient parallelization strategies for DNNs from their computation graphs. We present an efficient algorithm to compute these strategies within a reasonable time in practice. We evaluate the effectiveness of our approach on various DNNs. We also compare the performance of the strategies identified by our approach against data parallelism, expert-designed strategies, and the state-of-the-art approaches. Our results show that the strategies found using our approach outperform the baseline data parallelism strategy in all the cases. In addition, our strategies achieve better performance than the expert-designed strategies and the state-of-the-art approaches.
Paper Structure (18 sections, 4 theorems, 8 equations, 6 figures, 2 tables)

This paper contains 18 sections, 4 theorems, 8 equations, 6 figures, 2 tables.

Key Result

Theorem 1

Let $G=(V, E)$ be a computation graph for a DNN that is executed on $p$ devices with average FLOP-to-bytes ratio $r$. Let $\mathcal{V}$ be a sequence for $V$, and $\Phi$ be the set of all possible strategies for $G$. Then,

Figures (6)

  • Figure 1: Iteration space of a GEMM computation parallelized using the configuration $(1, 4, 2)$. $j$ and $k$ dimensions are split $4$-ways and $2$-ways, respectively, while the $i$ dimension is not parallelized.
  • Figure 2: A toy computation graph $G$, and an ordering $\mathcal{V}$ of its vertices. For the vertex $\mathpzc{v}^{(5)}$ (marked in green), its connected set $X(5)=\{\mathpzc{v}^{(1)}, \mathpzc{v}^{(2)}, \mathpzc{v}^{(3)}, \mathpzc{v}^{(5)}\}$, and its dependent set $D(5)=\{\mathpzc{v}^{(8)}\}$ (marked in red). Its connected subsets $S(i)=\{\{\mathpzc{v}^{(1)},\mathpzc{v}^{(2)}\}, \{\mathpzc{v}^{(3)}\}\}$ are represented by blue boxes in the figure. A similar, but more elaborate, structure appears in InceptionV3 (refer Fig. \ref{['fig:inception-graph']}) and Transformer models.
  • Figure 3: Algorithm to generate a sequence $\mathcal{V}$ such that sizes of dependent sets are small.
  • Figure 4: Dynamic programming based algorithm to compute an efficient strategy for a computation graph $G$.
  • Figure 5: Computation subgraph corresponding to InceptionE module of InceptionV3. A similar structure repeats throughout the graph. Nodes $171$ and $193$ have high degree, while the rest of the nodes are sparse.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2
  • Theorem 2
  • Theorem 2