Table of Contents
Fetching ...

Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression

Wenchen Han, Shay Vargaftik, Michael Mitzenmacher, Brad Karp, Ran Ben Basat

TL;DR

This work revisits common compression approaches (sparsification, quantization, and low-rank decomposition) and demonstrates how considering the above issues can lead to minor but strategic design changes, resulting in notably better performance.

Abstract

Gradient aggregation has long been identified as a major bottleneck in today's large-scale distributed machine learning training systems. One promising solution to mitigate such bottlenecks is gradient compression, directly reducing communicated gradient data volume. However, in practice, many gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. In this work, we identify common issues in previous gradient compression systems and evaluation methodologies. These include excessive computational overheads; incompatibility with all-reduce; and insufficient evaluation methods, such as not using an end-to-end metric or using a 32-bit baseline instead of the stronger 16-bit baseline. We revisit common compression approaches (sparsification, quantization, and low-rank decomposition) and demonstrate how considering the above issues can lead to minor but strategic design changes, resulting in notably better performance. Our goal is to raise awareness of the need for design and evaluation standards that naturally translate to the end-to-end utility of gradient compression.

Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression

TL;DR

This work revisits common compression approaches (sparsification, quantization, and low-rank decomposition) and demonstrates how considering the above issues can lead to minor but strategic design changes, resulting in notably better performance.

Abstract

Gradient aggregation has long been identified as a major bottleneck in today's large-scale distributed machine learning training systems. One promising solution to mitigate such bottlenecks is gradient compression, directly reducing communicated gradient data volume. However, in practice, many gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. In this work, we identify common issues in previous gradient compression systems and evaluation methodologies. These include excessive computational overheads; incompatibility with all-reduce; and insufficient evaluation methods, such as not using an end-to-end metric or using a 32-bit baseline instead of the stronger 16-bit baseline. We revisit common compression approaches (sparsification, quantization, and low-rank decomposition) and demonstrate how considering the above issues can lead to minor but strategic design changes, resulting in notably better performance. Our goal is to raise awareness of the need for design and evaluation standards that naturally translate to the end-to-end utility of gradient compression.
Paper Structure (16 sections, 3 figures, 9 tables)

This paper contains 16 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: The TTA (rolling averaged) of our TopK Chunked (TopKC) solution compared with TopK and the baselines. The dashed lines indicate the converged perplexity/accuracy for Baseline FP16 and Baseline FP32 respectively. The training of each method stops after a given number of epochs (and not hours) after convergence.
  • Figure 2: The TTA of THC's simple adaptation to all-reduce compared with THC adding saturation and partial rotation.
  • Figure 3: The TTA of PowerSGD, altering the matrix rank $r$.