Optimizing Distributed Training Approaches for Scaling Neural Networks
Vishnu Vardhan Baligodugula, Fathi Amsaad
TL;DR
The paper tackles the efficiency bottlenecks of distributed training for large neural networks by comparing data parallelism, model parallelism, and hybrid approaches, and by introducing an adaptive scheduling algorithm (ASA) that dynamically assigns parallelism strategies to network components. Through profiling-based optimization and per-component strategy selection, ASA minimizes training time while maintaining comparable accuracy on CIFAR-100 using ResNet-50 and ViT-B/16 on an 8-GPU NVLink cluster. The results show substantial speedups over single-device and static-hybrid baselines, reduced communication overhead, and improved scaling, with memory considerations driving per-component strategy choices. The work demonstrates the practical impact of adaptive, memory-aware parallelism in accelerating training for diverse architectures, and points to extensions for larger models and heterogeneous hardware.
Abstract
This paper presents a comparative analysis of distributed training strategies for large-scale neural networks, focusing on data parallelism, model parallelism, and hybrid approaches. We evaluate these strategies on image classification tasks using the CIFAR-100 dataset, measuring training time, convergence rate, and model accuracy. Our experimental results demonstrate that hybrid parallelism achieves a 3.2x speedup compared to single-device training while maintaining comparable accuracy. We propose an adaptive scheduling algorithm that dynamically switches between parallelism strategies based on network characteristics and available computational resources, resulting in an additional 18% improvement in training efficiency.
