STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-design
Changhai Man, Joongun Park, Hanjiang Wu, Huan Xu, Srinivas Sridharan, Tushar Krishna
TL;DR
STAGE addresses the scarcity and narrow applicability of real execution traces for distributed AI workloads by introducing Symbolic Tensor Graph Generator, a framework that synthesizes high-fidelity execution graphs using a symbolic tensor representation. It supports diverse LLM architectures and parallelism strategies, including tensor- and graph-level distributions, and outputs Chakra-based execution graphs that encode compute, memory, and communication dependencies. The approach is validated against real traces and demonstrated at scale up to 32K GPUs, while remaining adaptable to future system configurations and architectures. Publicly available and scalable, STAGE enables fast, systematic design-space exploration and system-level optimization for large-scale AI infrastructure.
Abstract
Optimizing the performance of large language models (LLMs) on large-scale AI training and inference systems requires a scalable and expressive mechanism to model distributed workload execution. Such modeling is essential for pre-deployment system-level optimizations (e.g., parallelization strategies) and design-space explorations. While recent efforts have proposed collecting execution traces from real systems, access to large-scale infrastructure remains limited to major cloud providers. Moreover, traces obtained from existing platforms cannot be easily adapted to study future larger-scale system configurations. We introduce Symbolic Tensor grAph GEnerator(STAGE), a framework that synthesizes high-fidelity execution traces to accurately model LLM workloads. STAGE supports a comprehensive set of parallelization strategies, allowing users to systematically explore a wide spectrum of LLM architectures and system configurations. STAGE demonstrates its scalability by synthesizing high-fidelity LLM traces spanning over 32K GPUs, while preserving tensor-level accuracy in compute, memory, and communication. STAGE is publicly available to facilitate further research in distributed machine learning systems: https://github.com/astra-sim/symbolic tensor graph
