Biased Local SGD for Efficient Deep Learning on Heterogeneous Systems
Jihyun Lim, Junhyuk Jo, Chanhyeok Ko, Young Min Go, Jimin Hwa, Sunwoo Lee
TL;DR
This work tackles synchronization bottlenecks in parallel deep learning on heterogeneous CPU/GPU systems by proposing system-aware biased local SGD. It combines unbalanced local updates (fast devices perform more steps than slow ones) with two bias injections—loss-based top-$k$ data sampling for fast devices and biased, weight-aware model aggregation—to preserve convergence under IID data. Theoretical guarantees show convergence of biased local SGD with a bound on the average gradient norm, while experiments demonstrate large wall-clock time reductions (up to ~32x) with comparable or improved accuracy across CV and NLP benchmarks. The approach offers a practical pathway to fully utilize diverse compute resources in real-world deep learning deployments.
Abstract
Most parallel neural network training methods assume homogeneous computing resources. For example, synchronous data-parallel SGD suffers from significant synchronization overhead under heterogeneous workloads, often forcing practitioners to rely only on the fastest devices (e.g., GPUs). In this work, we study local SGD for efficient parallel training on heterogeneous systems. We show that intentionally introducing bias in data sampling and model aggregation can effectively harmonize slower CPUs with faster GPUs. Our extensive empirical results demonstrate that a carefully controlled bias significantly accelerates local SGD while achieving comparable or even higher accuracy than synchronous SGD under the same epoch budget. For instance, our method trains ResNet20 on CIFAR-10 with 2 CPUs and 8 GPUs up to 32x faster than synchronous SGD, with nearly identical accuracy. These results provide practical insights into how to flexibly utilize diverse compute resources for deep learning.
