MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems
Samuel Hsia, Alicia Golden, Bilge Acun, Newsha Ardalani, Zachary DeVito, Gu-Yeon Wei, David Brooks, Carole-Jean Wu
TL;DR
The paper tackles the high cost and latency of training and serving trillion-parameter ML models by revealing substantial non-overlapped communication overhead in large-scale systems. It introduces MAD-Max, a trace-based performance model that predicts per-layer compute and inter-device communication to enable agile design-space exploration of hierarchical parallelism and hardware-software co-design. Validated on real-world DLRM and LLM workloads, MAD-Max demonstrates throughput gains up to $2.24\times$ for pre-training and $5.27\times$ for inference, while highlighting the need for coordinated advances across compute, memory, and interconnect. By providing actionable insights into levelized parallelism and system design and by open-sourcing the framework, the work offers a practical pathway to accelerate large ML model training and deployment.
Abstract
Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24x for pre-training and up to 5.2x for inference scenarios, respectively.
