Table of Contents
Fetching ...

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Jared Fernandez, Luca Wehrstedt, Leonid Shamis, Mostafa Elhoushi, Kalyan Saladi, Yonatan Bisk, Emma Strubell, Jacob Kahn

TL;DR

This work analyzes how hardware scaling affects large-scale distributed training of transformer-based models. By empirically evaluating data, sharded data, and model parallelism across varying hardware configurations, model sizes, and context lengths, it demonstrates that communication overhead increasingly binds performance as scale grows. The findings show that model-parallel approaches can mitigate FSDP bottlenecks and that simply adding more accelerators yields diminishing returns in power efficiency and throughput. The study provides practical guidance on compute- and communication-aware scaling, highlighting the need for balanced hardware upgrades and alternative training paradigms to sustain efficiency at very large scales.

Abstract

Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this work, we demonstrate that careful consideration of hardware configuration and parallelization strategy is critical for effective (i.e. compute- and cost-efficient) scaling of model size, training data, and total computation. We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies. We demonstrate that: (1) beyond certain scales, overhead incurred from certain distributed communication strategies leads parallelization strategies previously thought to be sub-optimal in fact become preferable; and (2) scaling the total number of accelerators for large model training quickly yields diminishing returns even when hardware and parallelization strategies are properly optimized, implying poor marginal performance per additional unit of power or GPU-hour.

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

TL;DR

This work analyzes how hardware scaling affects large-scale distributed training of transformer-based models. By empirically evaluating data, sharded data, and model parallelism across varying hardware configurations, model sizes, and context lengths, it demonstrates that communication overhead increasingly binds performance as scale grows. The findings show that model-parallel approaches can mitigate FSDP bottlenecks and that simply adding more accelerators yields diminishing returns in power efficiency and throughput. The study provides practical guidance on compute- and communication-aware scaling, highlighting the need for balanced hardware upgrades and alternative training paradigms to sustain efficiency at very large scales.

Abstract

Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this work, we demonstrate that careful consideration of hardware configuration and parallelization strategy is critical for effective (i.e. compute- and cost-efficient) scaling of model size, training data, and total computation. We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies. We demonstrate that: (1) beyond certain scales, overhead incurred from certain distributed communication strategies leads parallelization strategies previously thought to be sub-optimal in fact become preferable; and (2) scaling the total number of accelerators for large model training quickly yields diminishing returns even when hardware and parallelization strategies are properly optimized, implying poor marginal performance per additional unit of power or GPU-hour.

Paper Structure

This paper contains 30 sections, 14 figures, 1 table.

Figures (14)

  • Figure 1: Despite minimal communication overhead on less than 32 nodes, increasing communication overhead leads FSDP to observe diminishing returns on power efficiency with over 30% reduction at scale.
  • Figure 2: Bandwidth measurements in GB per second of NCCL primitives on DGX H100 servers with eight GPUs per node, connected with InfiniBand, across world sizes from 4 to 512 nodes.
  • Figure 3: In FSDP training of Llama-7B, scaling the number of nodes and data parallel replicas reduces hardware utilization and power efficiency due to increasing exposed communication derived from increases in the size of communication kernels relative to fixed size computation kernels. Global throughput observes sub-linear scaling despite approximately linear increases in the total power utilization with number of nodes. "Ideal Hardware Scaling" corresponds to expected throughput assuming additional accelerators yield linear increases in throughput.
  • Figure 4: The relative execution time of both AllGather and ReduceScatter collectives scale with hardware world size.
  • Figure 5: Training with Fixed Global Batch Size Over Increasing Number of Nodes. We select the optimal parallelization strategy as determined by the experimental results displayed in Figure \ref{['fig:mp-pp']} for configurations of up to 32 H100 nodes to train with global batch size of 32. Even with optimal parallelization strategies, local throughput and hardware utilization declines with world size.
  • ...and 9 more figures