Activations and Gradients Compression for Model-Parallel Training

Mikhail Rudakov; Aleksandr Beznosikov; Yaroslav Kholodov; Alexander Gasnikov

Activations and Gradients Compression for Model-Parallel Training

Mikhail Rudakov, Aleksandr Beznosikov, Yaroslav Kholodov, Alexander Gasnikov

TL;DR

This work explores how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence, and analyzes compression methods such as quantization and TopK compression, and also experiment with error compensation techniques.

Abstract

Large neural networks require enormous computational clusters of machines. Model-parallel training, when the model architecture is partitioned sequentially between workers, is a popular approach for training modern models. Information compression can be applied to decrease workers communication time, as it is often a bottleneck in such systems. This work explores how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We analyze compression methods such as quantization and TopK compression, and also experiment with error compensation techniques. Moreover, we employ TopK with AQ-SGD per-batch error feedback approach. We conduct experiments on image classification and language model fine-tuning tasks. Our findings demonstrate that gradients require milder compression rates than activations. We observe that $K=10\%$ is the lowest TopK compression level, which does not harm model convergence severely. Experiments also show that models trained with TopK perform well only when compression is also applied during inference. We find that error feedback techniques do not improve model-parallel training compared to plain compression, but allow model inference without compression with almost no quality drop. Finally, when applied with the AQ-SGD approach, TopK stronger than with $ K=30\%$ worsens model performance significantly.

Activations and Gradients Compression for Model-Parallel Training

TL;DR

Abstract

is the lowest TopK compression level, which does not harm model convergence severely. Experiments also show that models trained with TopK perform well only when compression is also applied during inference. We find that error feedback techniques do not improve model-parallel training compared to plain compression, but allow model inference without compression with almost no quality drop. Finally, when applied with the AQ-SGD approach, TopK stronger than with

worsens model performance significantly.

Paper Structure (19 sections, 6 figures, 5 tables)

This paper contains 19 sections, 6 figures, 5 tables.

Introduction
Contributions
Methodology
Experiment Design: Model-Parallel Training with Compression
Quantization
TopK compression
Error Feedback
AQ-SGD compression
Results
Training ResNet18 on CIFAR-10
Quantization
TopK Compression
Compression with Error Feedback
AQ-SGD with TopK compression
Fine-tuning GPT-2 on Wikitext
...and 4 more sections

Figures (6)

Figure 1: Model parallel training example. Model parallelism degree is two, with one compression block. Activations are compressed in the forward pass, and gradients are compressed in the backward pass.
Figure 2: Quantization experiments convergence on ResNet18 and CIFAR-10. fw[A]-bw[B] means compressing activations to A bits, gradients to B bits. Each line is average $\pm$ standard error over 5 runs.
Figure 3: TopK experiments convergence on ResNet18 and CIFAR-10. Each line is average $\pm$ standard error over 5 runs. Model-parallel degree is 4, with 3 compression operations used. Activations and gradients are compressed independently.
Figure 4: Error feedback experiments convergence on ResNet18 and CIFAR-10. Model-parallel degree is 4, with 3 compression operations used. Activations and gradients are compressed independently, with global EF batch buffer for each compression operator. Runs with suffix base20 use uncompressed baseline weights after 20 epochs.
Figure 5: AQ-SGD and TopK experiments convergence on ResNet18 and CIFAR-10. Model-parallel degree is 4, with 3 compression operations used. Activations and gradients are compressed independently, with AQ-SGD per-example buffer applied only for activations. Runs with suffix base10 use uncompressed baseline weights after 10 epochs.
...and 1 more figures

Activations and Gradients Compression for Model-Parallel Training

TL;DR

Abstract

Activations and Gradients Compression for Model-Parallel Training

Authors

TL;DR

Abstract

Table of Contents

Figures (6)