Table of Contents
Fetching ...

Echo: Simulating Distributed Training At Scale

Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, Hong Xu

TL;DR

Echo addresses the challenge of simulating large-scale distributed training without full-scale deployments by combining ex-situ workload tracing, white-box NCCL-based communication modeling, and a black-box slowdown predictor for overlap. It demonstrates high end-to-end fidelity (average training step error around 8–9%) and substantial speedups in simulation time, enabling practical exploration of 1k-GPU-scale configurations. The work presents a modular architecture (workload tracer, CC estimator, timeline composer, and validator) with a profiling database, and validates its effectiveness on Megatron-LM, GPT-175B, and other workloads across multi-GPU clusters. The results suggest Echo can be a practical, open-source tool for researchers and practitioners to plan, optimize, and scale distributed ML training.

Abstract

Simulation offers unique values for both enumeration and extrapolation purposes, and is becoming increasingly important for managing the massive machine learning (ML) clusters and large-scale distributed training jobs. In this paper, we build Echo to tackle three key challenges in large-scale training simulation: (1) tracing the runtime training workloads at each device in an ex-situ fashion so we can use a single device to obtain the actual execution graphs of 1K-GPU training, (2) accurately estimating the collective communication without high overheads of discrete-event based network simulation, and (3) accounting for the interference-induced computation slowdown from overlapping communication and computation kernels on the same device. Echo delivers on average 8% error in training step -- roughly 3x lower than state-of-the-art simulators -- for GPT-175B on a 96-GPU H800 cluster with 3D parallelism on Megatron-LM under 2 minutes.

Echo: Simulating Distributed Training At Scale

TL;DR

Echo addresses the challenge of simulating large-scale distributed training without full-scale deployments by combining ex-situ workload tracing, white-box NCCL-based communication modeling, and a black-box slowdown predictor for overlap. It demonstrates high end-to-end fidelity (average training step error around 8–9%) and substantial speedups in simulation time, enabling practical exploration of 1k-GPU-scale configurations. The work presents a modular architecture (workload tracer, CC estimator, timeline composer, and validator) with a profiling database, and validates its effectiveness on Megatron-LM, GPT-175B, and other workloads across multi-GPU clusters. The results suggest Echo can be a practical, open-source tool for researchers and practitioners to plan, optimize, and scale distributed ML training.

Abstract

Simulation offers unique values for both enumeration and extrapolation purposes, and is becoming increasingly important for managing the massive machine learning (ML) clusters and large-scale distributed training jobs. In this paper, we build Echo to tackle three key challenges in large-scale training simulation: (1) tracing the runtime training workloads at each device in an ex-situ fashion so we can use a single device to obtain the actual execution graphs of 1K-GPU training, (2) accurately estimating the collective communication without high overheads of discrete-event based network simulation, and (3) accounting for the interference-induced computation slowdown from overlapping communication and computation kernels on the same device. Echo delivers on average 8% error in training step -- roughly 3x lower than state-of-the-art simulators -- for GPT-175B on a 96-GPU H800 cluster with 3D parallelism on Megatron-LM under 2 minutes.

Paper Structure

This paper contains 30 sections, 2 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Overview of the technical stack and architecture for large-scale model training.
  • Figure 2: Bus bandwidth of all-reduce communication operation with varying message sizes and GPU counts, profiled on A100 clusters.
  • Figure 3: CDF of slowdown factor of various kernels in GPT2.
  • Figure 4: Echo architecture overview. The green components represent the core modules of Echo.
  • Figure 5: Echo's MPU module and workload tracer.
  • ...and 12 more figures