Table of Contents
Fetching ...

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna

TL;DR

This paper presents ASTRA-sim2.0, an extended framework that enables rapid design-space exploration of large-scale distributed training by supporting arbitrary parallelism, hierarchical multi-dimensional networks, and memory-disaggregation models. It introduces a graph-based execution engine, a topology-building taxonomy, an analytical network backend, and memory models that capture in-switch collectives and remote memory pools, validated against real systems with substantial speedups over cycle-accurate simulations. Case studies show that with suitable scheduling and parallelization design, conventional scale-out networks can match wafer-scale performance, while wafer-scale configurations can offer up to around $2.51\times$ improvements in certain regimes; disaggregated-memory configurations can deliver up to $4.6\times$ speedups in exposed times for MoE workloads. Overall, ASTRA-sim2.0 enables swift, informative co-design of future distributed training platforms at scale, guiding hardware-software co-design decisions.

Abstract

As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emerging distributed training systems. This results in a complex SW/HW co-design stack of distributed training, necessitating a modeling/simulation infrastructure for design-space exploration. In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii) we implement a parameterizable multi-dimensional heterogeneous topology generation infrastructure with analytical performance estimates enabling simulating target systems at scale, and (iii) we enhance the memory system modeling to support accurate modeling of in-network collective communication and disaggregated memory systems. With such capabilities, we run comprehensive case studies targeting emerging distributed models and platforms. This infrastructure lets system designers swiftly traverse the complex co-design stack and give meaningful insights when designing and deploying distributed training platforms at scale.

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

TL;DR

This paper presents ASTRA-sim2.0, an extended framework that enables rapid design-space exploration of large-scale distributed training by supporting arbitrary parallelism, hierarchical multi-dimensional networks, and memory-disaggregation models. It introduces a graph-based execution engine, a topology-building taxonomy, an analytical network backend, and memory models that capture in-switch collectives and remote memory pools, validated against real systems with substantial speedups over cycle-accurate simulations. Case studies show that with suitable scheduling and parallelization design, conventional scale-out networks can match wafer-scale performance, while wafer-scale configurations can offer up to around improvements in certain regimes; disaggregated-memory configurations can deliver up to speedups in exposed times for MoE workloads. Overall, ASTRA-sim2.0 enables swift, informative co-design of future distributed training platforms at scale, guiding hardware-software co-design decisions.

Abstract

As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emerging distributed training systems. This results in a complex SW/HW co-design stack of distributed training, necessitating a modeling/simulation infrastructure for design-space exploration. In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii) we implement a parameterizable multi-dimensional heterogeneous topology generation infrastructure with analytical performance estimates enabling simulating target systems at scale, and (iii) we enhance the memory system modeling to support accurate modeling of in-network collective communication and disaggregated memory systems. With such capabilities, we run comprehensive case studies targeting emerging distributed models and platforms. This infrastructure lets system designers swiftly traverse the complex co-design stack and give meaningful insights when designing and deploying distributed training platforms at scale.
Paper Structure (21 sections, 9 equations, 13 figures, 5 tables)

This paper contains 21 sections, 9 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview of Proposed Infrastructure for Modeling Next-generation Training platforms. The components extended in ASTRA-sim2.0 from the original ASTRA-sim to model emerging platforms are marked in bold.
  • Figure 2: Definition of Reduce-Scatter, All-Gather, All-Reduce, and All-to-All collective communication patterns.
  • Figure 3: (a) Hierarchical topology building blocks: Ring, FullyConnected, and Switch (b) Multi-dimensional network topologies are created by stacking up network building blocks (c) Multi-dimensional hierarchical topology examples, their shape notations, and corresponding distributed training framework.
  • Figure 4: Analytical network backend validation over real system measurements ranging from 64MB--1.5GB All-Reduce collectives.
  • Figure 5: Various memory pool architectures.
  • ...and 8 more figures