Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
Jared Fernandez, Luca Wehrstedt, Leonid Shamis, Mostafa Elhoushi, Kalyan Saladi, Yonatan Bisk, Emma Strubell, Jacob Kahn
TL;DR
This work analyzes how hardware scaling affects large-scale distributed training of transformer-based models. By empirically evaluating data, sharded data, and model parallelism across varying hardware configurations, model sizes, and context lengths, it demonstrates that communication overhead increasingly binds performance as scale grows. The findings show that model-parallel approaches can mitigate FSDP bottlenecks and that simply adding more accelerators yields diminishing returns in power efficiency and throughput. The study provides practical guidance on compute- and communication-aware scaling, highlighting the need for balanced hardware upgrades and alternative training paradigms to sustain efficiency at very large scales.
Abstract
Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this work, we demonstrate that careful consideration of hardware configuration and parallelization strategy is critical for effective (i.e. compute- and cost-efficient) scaling of model size, training data, and total computation. We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies. We demonstrate that: (1) beyond certain scales, overhead incurred from certain distributed communication strategies leads parallelization strategies previously thought to be sub-optimal in fact become preferable; and (2) scaling the total number of accelerators for large model training quickly yields diminishing returns even when hardware and parallelization strategies are properly optimized, implying poor marginal performance per additional unit of power or GPU-hour.
