Table of Contents
Fetching ...

COSMIC: Enabling Full-Stack Co-Design and Optimization of Distributed Machine Learning Systems

Aditi Raju, Jared Ni, William Won, Changhai Man, Srivatsan Krishnan, Srinivas Sridharan, Amir Yazdanbakhsh, Tushar Krishna, Vijay Janapa Reddi

TL;DR

Large transformer-scale models require distributed ML systems with complex cross-layer design. COSMIC combines Parameter Set Architecture (Psa) and end-to-end simulation (ASTRA-sim) with agent-based exploration (ArchGym) to enable efficient full-stack design space exploration across workload, collective, network, and compute. The framework demonstrates substantial performance and cost benefits over isolated optimizations on multiple models up to 175B parameters, and proves scalability to 2,048 NPUs with diverse workloads. This full-stack co-design approach enables automatic, cross-layer optimization and reveals non-obvious, high-performing configurations that reduce end-to-end runtime and resource cost.

Abstract

Large-scale machine learning models necessitate distributed systems, posing significant design challenges due to the large parameter space across distinct design stacks. Existing studies often focus on optimizing individual system aspects in isolation. This work challenges this limitation and introduces COSMIC, a full-stack distributed machine learning systems environment enabling end-to-end simulation and agent-based design space exploration. To facilitate efficient exploration and optimization across the entire stack, we introduce Parameter Set Architecture-an abstraction concept analogous to the instruction set architecture-abstracting away configuration complexities of agent-based search methods. Case studies demonstrate COSMIC's ability to consolidate parameters across multiple layers of design abstraction, discovering eight non-obvious high-performance system configurations across four transformer-based models with up to 175 billion parameters. By optimizing across the stack, COSMIC full-stack optimization delivers 1.50-48.41x higher performance compared to the isolated single-stack optimization.

COSMIC: Enabling Full-Stack Co-Design and Optimization of Distributed Machine Learning Systems

TL;DR

Large transformer-scale models require distributed ML systems with complex cross-layer design. COSMIC combines Parameter Set Architecture (Psa) and end-to-end simulation (ASTRA-sim) with agent-based exploration (ArchGym) to enable efficient full-stack design space exploration across workload, collective, network, and compute. The framework demonstrates substantial performance and cost benefits over isolated optimizations on multiple models up to 175B parameters, and proves scalability to 2,048 NPUs with diverse workloads. This full-stack co-design approach enables automatic, cross-layer optimization and reveals non-obvious, high-performing configurations that reduce end-to-end runtime and resource cost.

Abstract

Large-scale machine learning models necessitate distributed systems, posing significant design challenges due to the large parameter space across distinct design stacks. Existing studies often focus on optimizing individual system aspects in isolation. This work challenges this limitation and introduces COSMIC, a full-stack distributed machine learning systems environment enabling end-to-end simulation and agent-based design space exploration. To facilitate efficient exploration and optimization across the entire stack, we introduce Parameter Set Architecture-an abstraction concept analogous to the instruction set architecture-abstracting away configuration complexities of agent-based search methods. Case studies demonstrate COSMIC's ability to consolidate parameters across multiple layers of design abstraction, discovering eight non-obvious high-performance system configurations across four transformer-based models with up to 175 billion parameters. By optimizing across the stack, COSMIC full-stack optimization delivers 1.50-48.41x higher performance compared to the isolated single-stack optimization.

Paper Structure

This paper contains 27 sections, 2 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Common parallelization strategies common in distributed ML: DP, SP, PP, and TP.
  • Figure 2: Common collective communication patterns incurred in distributed ML for NPU synchronization.
  • Figure 3: Network topology building blocks we considered to construct multi-dimensional topologies.
  • Figure 4: (a) shows latency spread for training GPT3-175B just varying the workload parameters (i.e., workload-only search for System 2, see \ref{['subsec:experiments']}). Notably, the parallelization optimal for the target cluster achieved 64.5$\times$ better performance compared to the worst case, highlighting co-optimization potential. (b)--(d) shows workload+network, workload+collective, and full-stack optimization results for GPT3-175B. (e) latency spread for workload-only DSE for GPT3-13B, (f) workload-only DSE for ViT-Large, (g) full-stack DSE for ViT-Large, and (h) full-stack DSE for ViT-Base.
  • Figure 5: Summary of (i) Psa to capture the full-stack distributed ML design space and (ii) ML-based Psa optimization framework (Cosmic) to design new distributed ML infrastructures.
  • ...and 5 more figures