Table of Contents
Fetching ...

Biased Local SGD for Efficient Deep Learning on Heterogeneous Systems

Jihyun Lim, Junhyuk Jo, Chanhyeok Ko, Young Min Go, Jimin Hwa, Sunwoo Lee

TL;DR

This work tackles synchronization bottlenecks in parallel deep learning on heterogeneous CPU/GPU systems by proposing system-aware biased local SGD. It combines unbalanced local updates (fast devices perform more steps than slow ones) with two bias injections—loss-based top-$k$ data sampling for fast devices and biased, weight-aware model aggregation—to preserve convergence under IID data. Theoretical guarantees show convergence of biased local SGD with a bound on the average gradient norm, while experiments demonstrate large wall-clock time reductions (up to ~32x) with comparable or improved accuracy across CV and NLP benchmarks. The approach offers a practical pathway to fully utilize diverse compute resources in real-world deep learning deployments.

Abstract

Most parallel neural network training methods assume homogeneous computing resources. For example, synchronous data-parallel SGD suffers from significant synchronization overhead under heterogeneous workloads, often forcing practitioners to rely only on the fastest devices (e.g., GPUs). In this work, we study local SGD for efficient parallel training on heterogeneous systems. We show that intentionally introducing bias in data sampling and model aggregation can effectively harmonize slower CPUs with faster GPUs. Our extensive empirical results demonstrate that a carefully controlled bias significantly accelerates local SGD while achieving comparable or even higher accuracy than synchronous SGD under the same epoch budget. For instance, our method trains ResNet20 on CIFAR-10 with 2 CPUs and 8 GPUs up to 32x faster than synchronous SGD, with nearly identical accuracy. These results provide practical insights into how to flexibly utilize diverse compute resources for deep learning.

Biased Local SGD for Efficient Deep Learning on Heterogeneous Systems

TL;DR

This work tackles synchronization bottlenecks in parallel deep learning on heterogeneous CPU/GPU systems by proposing system-aware biased local SGD. It combines unbalanced local updates (fast devices perform more steps than slow ones) with two bias injections—loss-based top- data sampling for fast devices and biased, weight-aware model aggregation—to preserve convergence under IID data. Theoretical guarantees show convergence of biased local SGD with a bound on the average gradient norm, while experiments demonstrate large wall-clock time reductions (up to ~32x) with comparable or improved accuracy across CV and NLP benchmarks. The approach offers a practical pathway to fully utilize diverse compute resources in real-world deep learning deployments.

Abstract

Most parallel neural network training methods assume homogeneous computing resources. For example, synchronous data-parallel SGD suffers from significant synchronization overhead under heterogeneous workloads, often forcing practitioners to rely only on the fastest devices (e.g., GPUs). In this work, we study local SGD for efficient parallel training on heterogeneous systems. We show that intentionally introducing bias in data sampling and model aggregation can effectively harmonize slower CPUs with faster GPUs. Our extensive empirical results demonstrate that a carefully controlled bias significantly accelerates local SGD while achieving comparable or even higher accuracy than synchronous SGD under the same epoch budget. For instance, our method trains ResNet20 on CIFAR-10 with 2 CPUs and 8 GPUs up to 32x faster than synchronous SGD, with nearly identical accuracy. These results provide practical insights into how to flexibly utilize diverse compute resources for deep learning.

Paper Structure

This paper contains 18 sections, 2 theorems, 20 equations, 4 figures, 7 tables, 1 algorithm.

Key Result

Lemma 1

(framework) Under assumption 1 $\sim$ 3, if the learning rate $\eta \leq \frac{1}{\mathcal{L}\tau}$, we have

Figures (4)

  • Figure 1: Timing breakdown of a single training step, measured on an AMD EPYC 7543 CPU and two NVIDIA GeForce RTX 4090 GPUs. CPU training is an order of magnitude slower than GPU training. The resulting timing gap causes GPUs to remain idle, representing the synchronization cost.
  • Figure 2: Schematic illustration of system-aware biased local SGD framework. The $\tau_F$ and $\tau_S$ are the number of local updates per communication round on fast and slow resources, respectively. The $x_F$ and $x_S$ are the models locally trained on fast and slow resources (e.g., GPU and CPU), respectively.
  • Figure 3: Learning curve comparison. Thanks to reduced communication and synchronization costs, our proposed method completes the same number of training rounds much faster than synchronous SGD and balanced local SGD.
  • Figure 4: Feature-wise accuracy comparison with various combinations of fast and slow resources.

Theorems & Definitions (4)

  • Lemma 1
  • proof
  • Theorem 1
  • proof