Table of Contents
Fetching ...

Optimizing Distributed Training Approaches for Scaling Neural Networks

Vishnu Vardhan Baligodugula, Fathi Amsaad

TL;DR

The paper tackles the efficiency bottlenecks of distributed training for large neural networks by comparing data parallelism, model parallelism, and hybrid approaches, and by introducing an adaptive scheduling algorithm (ASA) that dynamically assigns parallelism strategies to network components. Through profiling-based optimization and per-component strategy selection, ASA minimizes training time while maintaining comparable accuracy on CIFAR-100 using ResNet-50 and ViT-B/16 on an 8-GPU NVLink cluster. The results show substantial speedups over single-device and static-hybrid baselines, reduced communication overhead, and improved scaling, with memory considerations driving per-component strategy choices. The work demonstrates the practical impact of adaptive, memory-aware parallelism in accelerating training for diverse architectures, and points to extensions for larger models and heterogeneous hardware.

Abstract

This paper presents a comparative analysis of distributed training strategies for large-scale neural networks, focusing on data parallelism, model parallelism, and hybrid approaches. We evaluate these strategies on image classification tasks using the CIFAR-100 dataset, measuring training time, convergence rate, and model accuracy. Our experimental results demonstrate that hybrid parallelism achieves a 3.2x speedup compared to single-device training while maintaining comparable accuracy. We propose an adaptive scheduling algorithm that dynamically switches between parallelism strategies based on network characteristics and available computational resources, resulting in an additional 18% improvement in training efficiency.

Optimizing Distributed Training Approaches for Scaling Neural Networks

TL;DR

The paper tackles the efficiency bottlenecks of distributed training for large neural networks by comparing data parallelism, model parallelism, and hybrid approaches, and by introducing an adaptive scheduling algorithm (ASA) that dynamically assigns parallelism strategies to network components. Through profiling-based optimization and per-component strategy selection, ASA minimizes training time while maintaining comparable accuracy on CIFAR-100 using ResNet-50 and ViT-B/16 on an 8-GPU NVLink cluster. The results show substantial speedups over single-device and static-hybrid baselines, reduced communication overhead, and improved scaling, with memory considerations driving per-component strategy choices. The work demonstrates the practical impact of adaptive, memory-aware parallelism in accelerating training for diverse architectures, and points to extensions for larger models and heterogeneous hardware.

Abstract

This paper presents a comparative analysis of distributed training strategies for large-scale neural networks, focusing on data parallelism, model parallelism, and hybrid approaches. We evaluate these strategies on image classification tasks using the CIFAR-100 dataset, measuring training time, convergence rate, and model accuracy. Our experimental results demonstrate that hybrid parallelism achieves a 3.2x speedup compared to single-device training while maintaining comparable accuracy. We propose an adaptive scheduling algorithm that dynamically switches between parallelism strategies based on network characteristics and available computational resources, resulting in an additional 18% improvement in training efficiency.

Paper Structure

This paper contains 24 sections, 2 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Training Time Comparison. Bar chart showing training time in hours for Single GPU, Data Parallel, Model Parallel, Hybrid Parallel, and Adaptive approaches for both ResNet-50 and ViT models.
  • Figure 2: Scalability Analysis. Line graph showing speedup vs. number of GPUs (1, 2, 4, 8) for each parallelism strategy.
  • Figure 3: Communication Overhead. Stacked bar chart showing proportion of time spent on computation vs. communication.
  • Figure 4: Convergence Comparison. Line graph showing validation accuracy vs. epochs for different parallelism strategies.
  • Figure 5: Memory Utilization. Bar chart showing peak GPU memory usage.
  • ...and 1 more figures