Table of Contents
Fetching ...

Integrated Hardware Architecture and Device Placement Search

Irene Wang, Jakub Tarnawski, Amar Phanishayee, Divya Mahajan

TL;DR

This is the first work to explore the co-optimization of determining the optimal architecture and device placement strategy through novel algorithms, improving the balance of computational resources, memory usage, and data distribution.

Abstract

Distributed execution of deep learning training involves a dynamic interplay between hardware accelerator architecture and device placement strategy. This is the first work to explore the co-optimization of determining the optimal architecture and device placement strategy through novel algorithms, improving the balance of computational resources, memory usage, and data distribution. Our architecture search leverages tensor and vector units, determining their quantity and dimensionality, and on-chip and off-chip memory configurations. It also determines the microbatch size and decides whether to recompute or stash activations, balancing the memory footprint of training and storage size. For each explored architecture configuration, we use an Integer Linear Program (ILP) to find the optimal schedule for executing operators on the accelerator. The ILP results then integrate with a dynamic programming solution to identify the most effective device placement strategy, combining data, pipeline, and tensor model parallelism across multiple accelerators. Our approach achieves higher throughput on large language models compared to the state-of-the-art TPUv4 and the Spotlight accelerator search framework. The entire source code of PHAZE is available at https://github.com/msr-fiddle/phaze.

Integrated Hardware Architecture and Device Placement Search

TL;DR

This is the first work to explore the co-optimization of determining the optimal architecture and device placement strategy through novel algorithms, improving the balance of computational resources, memory usage, and data distribution.

Abstract

Distributed execution of deep learning training involves a dynamic interplay between hardware accelerator architecture and device placement strategy. This is the first work to explore the co-optimization of determining the optimal architecture and device placement strategy through novel algorithms, improving the balance of computational resources, memory usage, and data distribution. Our architecture search leverages tensor and vector units, determining their quantity and dimensionality, and on-chip and off-chip memory configurations. It also determines the microbatch size and decides whether to recompute or stash activations, balancing the memory footprint of training and storage size. For each explored architecture configuration, we use an Integer Linear Program (ILP) to find the optimal schedule for executing operators on the accelerator. The ILP results then integrate with a dynamic programming solution to identify the most effective device placement strategy, combining data, pipeline, and tensor model parallelism across multiple accelerators. Our approach achieves higher throughput on large language models compared to the state-of-the-art TPUv4 and the Spotlight accelerator search framework. The entire source code of PHAZE is available at https://github.com/msr-fiddle/phaze.
Paper Structure (24 sections, 9 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 9 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: (a) The template for an accelerator architecture consisting of hierarchical compute units, on-chip buffers, and off-chip HBM. A core can be of the type tensor, vector, or fused. (b) An example training operator graph that is used to optimize the accelerator and the distribution strategy. (c) The combined search space explored by Phaze.
  • Figure 2: The Phaze workflow -- the graph extractor extracts layer and corresponding operator graphs, which are annotated with memory footprint and latency estimates. The solver iteratively explores each valid architecture configuration.
  • Figure 3: ILP constraints. The optimization objective is to minimize the total latency/makespan $T$ of the layer (layer slice).
  • Figure 4: Throughput comparison between the Phaze Common and Per Model configuration with TPUv4 and Spotlight generated architectures. DP here is Device Placement.
  • Figure 5: The solver is executed for every architecture that is explored. It takes as input the layer graph and the corresponding operator graph. The ILP optimization solves to determine the optimal schedule and latency for every layer/layer slice. This information is used by the dynamic programming optimization to determine the training distribution strategy.
  • ...and 4 more figures