Table of Contents
Fetching ...

MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training

Daegun Yoon, Sangyoon Oh

TL;DR

This work tackles the communication bottleneck in distributed DNN training by proposing MiCRO, a near-zero cost gradient sparsification method. It partitions the gradient vector into exclusive partitions and uses threshold-based selection within each partition, coupled with threshold scaling that minimizes the compression ratio error to meet a user-specified density, thereby preventing gradient build-up and reducing computation. The approach yields faster convergence and lower end-to-end training time across multiple models and datasets, with strong scalability as the number of workers grows. Overall, MiCRO enables efficient, scalable distributed training by combining partitioning, exclusive selection, and robust, low-overhead threshold adaptation, outperforming state-of-the-art sparsifiers in both convergence and throughput.

Abstract

Gradient sparsification is a communication optimisation technique for scaling and accelerating distributed deep neural network (DNN) training. It reduces the increasing communication traffic for gradient aggregation. However, existing sparsifiers have poor scalability because of the high computational cost of gradient selection and/or increase in communication traffic. In particular, an increase in communication traffic is caused by gradient build-up and inappropriate threshold for gradient selection. To address these challenges, we propose a novel gradient sparsification method called MiCRO. In MiCRO, the gradient vector is partitioned, and each partition is assigned to the corresponding worker. Each worker then selects gradients from its partition, and the aggregated gradients are free from gradient build-up. Moreover, MiCRO estimates the accurate threshold to maintain the communication traffic as per user requirement by minimising the compression ratio error. MiCRO enables near-zero cost gradient sparsification by solving existing problems that hinder the scalability and acceleration of distributed DNN training. In our extensive experiments, MiCRO outperformed state-of-the-art sparsifiers with an outstanding convergence rate.

MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training

TL;DR

This work tackles the communication bottleneck in distributed DNN training by proposing MiCRO, a near-zero cost gradient sparsification method. It partitions the gradient vector into exclusive partitions and uses threshold-based selection within each partition, coupled with threshold scaling that minimizes the compression ratio error to meet a user-specified density, thereby preventing gradient build-up and reducing computation. The approach yields faster convergence and lower end-to-end training time across multiple models and datasets, with strong scalability as the number of workers grows. Overall, MiCRO enables efficient, scalable distributed training by combining partitioning, exclusive selection, and robust, low-overhead threshold adaptation, outperforming state-of-the-art sparsifiers in both convergence and throughput.

Abstract

Gradient sparsification is a communication optimisation technique for scaling and accelerating distributed deep neural network (DNN) training. It reduces the increasing communication traffic for gradient aggregation. However, existing sparsifiers have poor scalability because of the high computational cost of gradient selection and/or increase in communication traffic. In particular, an increase in communication traffic is caused by gradient build-up and inappropriate threshold for gradient selection. To address these challenges, we propose a novel gradient sparsification method called MiCRO. In MiCRO, the gradient vector is partitioned, and each partition is assigned to the corresponding worker. Each worker then selects gradients from its partition, and the aggregated gradients are free from gradient build-up. Moreover, MiCRO estimates the accurate threshold to maintain the communication traffic as per user requirement by minimising the compression ratio error. MiCRO enables near-zero cost gradient sparsification by solving existing problems that hinder the scalability and acceleration of distributed DNN training. In our extensive experiments, MiCRO outperformed state-of-the-art sparsifiers with an outstanding convergence rate.
Paper Structure (15 sections, 1 equation, 8 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 1 equation, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Challenges in scalable and accelerated gradient sparsification: (a) High computational cost due to gradient vector sorting in sorting-based sparsifiers; (b) high communication cost due to inappropriate threshold in threshold-based sparsifiers. Both types of sparsifiers cause gradient build-up. All experiments were conducted using $d=0.01$ and $n=\{2,4,8,16\}$ with ResNet-18 on CIFAR-10.
  • Figure 2: Error minimisation performance of sparsifiers on 16 GPUs. The Y-axis indicates the error, which is the average of local errors of workers.
  • Figure 3: Overview of MiCRO.
  • Figure 4: Convergence performance of sparsifiers on 16 GPUs. All experiments were conducted over 200 epochs.
  • Figure 5: Sparsification performance of sparsifiers on 16 GPUs. The Y-axis indicates the actual density measured over training iterations.
  • ...and 3 more figures