Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation

Banruo Liu; Mubarak Adetunji Ojewale; Yuhan Ding; Marco Canini

Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation

Banruo Liu, Mubarak Adetunji Ojewale, Yuhan Ding, Marco Canini

TL;DR

Profiling distributed DNN training on large clusters is costly and disruptive. The authors present NeuronaBox, an emulator that runs a subset of nodes and emulates the networked environment for distributed training, while preserving realistic NCCL-based communication and omitting GPU compute on the emulator. The proof-of-concept demonstrates high fidelity, with less than 1% error in time-per-iteration across multiple models and a two-node setup, and shows low CPU overhead. This approach enables rapid what-if analyses and design-space exploration for distributed training configurations without requiring large-scale hardware deployments.

Abstract

We propose NeuronaBox, a flexible, user-friendly, and high-fidelity approach to emulate DNN training workloads. We argue that to accurately observe performance, it is possible to execute the training workload on a subset of real nodes and emulate the networked execution environment along with the collective communication operations. Initial results from a proof-of-concept implementation show that NeuronaBox replicates the behavior of actual systems with high accuracy, with an error margin of less than 1% between the emulated measurements and the real system.

Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation

TL;DR

Abstract

Paper Structure (12 sections, 4 figures, 2 tables)

This paper contains 12 sections, 4 figures, 2 tables.

Introduction
Proposed Approach
Initialization
Emulation in a Uniform Scenario
Extension to a Non-uniform Scenario
Proof-of-concept Implementation
Preliminary Experiments
Microbenchmark
End-to-end Training Emulation
What-if Analysis: Latency
Related work
Conclusion

Figures (4)

Figure 1: A training job running in a 4-node cluster (left) is emulated by executing a single real node ($N_0$) wrapped by NeuronaBox, which emulates the environment (right).
Figure 2: Overall workflow and architecture of NeuronaBox.
Figure 3: An example DAG for four-node ring all-reduce. The upper left squares shows the net result of all-reduce, where color-coded data from different nodes are reduced and then gathered at each node. The upper right figure shows the dependency DAG for ${\mathcal{N}}$ ($N_0$). The lower figure shows how we merge the dependency DAG of $N_1, N_2, N_3$ into ${\mathcal{E}}$. The cross-node dependencies from $send(x)$ to $recv(x)$ are not shown for clarity sake. We only show the initial 4 steps of all-reduce for simplicity.
Figure 4: The end-to-end training time per iteration in BERT model (ms) vs the additional delay injected in every all reduce call (ms). Error bar is plotted in black.

Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation

TL;DR

Abstract

Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)