COSMIC: Enabling Full-Stack Co-Design and Optimization of Distributed Machine Learning Systems
Aditi Raju, Jared Ni, William Won, Changhai Man, Srivatsan Krishnan, Srinivas Sridharan, Amir Yazdanbakhsh, Tushar Krishna, Vijay Janapa Reddi
TL;DR
Large transformer-scale models require distributed ML systems with complex cross-layer design. COSMIC combines Parameter Set Architecture (Psa) and end-to-end simulation (ASTRA-sim) with agent-based exploration (ArchGym) to enable efficient full-stack design space exploration across workload, collective, network, and compute. The framework demonstrates substantial performance and cost benefits over isolated optimizations on multiple models up to 175B parameters, and proves scalability to 2,048 NPUs with diverse workloads. This full-stack co-design approach enables automatic, cross-layer optimization and reveals non-obvious, high-performing configurations that reduce end-to-end runtime and resource cost.
Abstract
Large-scale machine learning models necessitate distributed systems, posing significant design challenges due to the large parameter space across distinct design stacks. Existing studies often focus on optimizing individual system aspects in isolation. This work challenges this limitation and introduces COSMIC, a full-stack distributed machine learning systems environment enabling end-to-end simulation and agent-based design space exploration. To facilitate efficient exploration and optimization across the entire stack, we introduce Parameter Set Architecture-an abstraction concept analogous to the instruction set architecture-abstracting away configuration complexities of agent-based search methods. Case studies demonstrate COSMIC's ability to consolidate parameters across multiple layers of design abstraction, discovering eight non-obvious high-performance system configurations across four transformer-based models with up to 175 billion parameters. By optimizing across the stack, COSMIC full-stack optimization delivers 1.50-48.41x higher performance compared to the isolated single-stack optimization.
