Table of Contents
Fetching ...

MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Samuel Hsia, Alicia Golden, Bilge Acun, Newsha Ardalani, Zachary DeVito, Gu-Yeon Wei, David Brooks, Carole-Jean Wu

TL;DR

The paper tackles the high cost and latency of training and serving trillion-parameter ML models by revealing substantial non-overlapped communication overhead in large-scale systems. It introduces MAD-Max, a trace-based performance model that predicts per-layer compute and inter-device communication to enable agile design-space exploration of hierarchical parallelism and hardware-software co-design. Validated on real-world DLRM and LLM workloads, MAD-Max demonstrates throughput gains up to $2.24\times$ for pre-training and $5.27\times$ for inference, while highlighting the need for coordinated advances across compute, memory, and interconnect. By providing actionable insights into levelized parallelism and system design and by open-sourcing the framework, the work offers a practical pathway to accelerate large ML model training and deployment.

Abstract

Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24x for pre-training and up to 5.2x for inference scenarios, respectively.

MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

TL;DR

The paper tackles the high cost and latency of training and serving trillion-parameter ML models by revealing substantial non-overlapped communication overhead in large-scale systems. It introduces MAD-Max, a trace-based performance model that predicts per-layer compute and inter-device communication to enable agile design-space exploration of hierarchical parallelism and hardware-software co-design. Validated on real-world DLRM and LLM workloads, MAD-Max demonstrates throughput gains up to for pre-training and for inference, while highlighting the need for coordinated advances across compute, memory, and interconnect. By providing actionable insights into levelized parallelism and system design and by open-sourcing the framework, the work offers a practical pathway to accelerate large ML model training and deployment.

Abstract

Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24x for pre-training and up to 5.2x for inference scenarios, respectively.
Paper Structure (15 sections, 20 figures, 4 tables)

This paper contains 15 sections, 20 figures, 4 tables.

Figures (20)

  • Figure 1: Our performance model -- MAD-Max -- improves upon the resource-performance pareto frontier of large-scale ML workloads by identifying new hardware-software mappings and solutions.
  • Figure 2: For recommendation models, applying FSDP, TP, or DDP on an MLP layer requires either sharding or replicating parameters and communicating either parameters (orange) or partial sums (yellow). In this example, the embedding table's prohibitively large capacity requires it to be sharded.
  • Figure 3: For large ML models, the requirements for key system resources -- (a) capacity, (b) compute, (c) bandwidth -- vary by orders of magnitude.
  • Figure 4: (a) Compute and exposed communication make up the majority of observed at-scale training cycles. (b) The degree of communication overlapped with computation and data loading is workload dependent. Higher degree of overlap indicates better latency hiding of communication collectives. (c) Breakdown of communication collectives also varies by workload.
  • Figure 5: Our performance model works in five stages. After workload specifications and layer execution orders are established, traces for individual layer execution are generated and then combined with required communication collectives to form complete computation and communication streams.
  • ...and 15 more figures