Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation
Banruo Liu, Mubarak Adetunji Ojewale, Yuhan Ding, Marco Canini
TL;DR
Profiling distributed DNN training on large clusters is costly and disruptive. The authors present NeuronaBox, an emulator that runs a subset of nodes and emulates the networked environment for distributed training, while preserving realistic NCCL-based communication and omitting GPU compute on the emulator. The proof-of-concept demonstrates high fidelity, with less than 1% error in time-per-iteration across multiple models and a two-node setup, and shows low CPU overhead. This approach enables rapid what-if analyses and design-space exploration for distributed training configurations without requiring large-scale hardware deployments.
Abstract
We propose NeuronaBox, a flexible, user-friendly, and high-fidelity approach to emulate DNN training workloads. We argue that to accurately observe performance, it is possible to execute the training workload on a subset of real nodes and emulate the networked execution environment along with the collective communication operations. Initial results from a proof-of-concept implementation show that NeuronaBox replicates the behavior of actual systems with high accuracy, with an error margin of less than 1% between the emulated measurements and the real system.
