Layer-wise Quantization for Quantized Optimistic Dual Averaging

Anh Duc Nguyen; Ilia Markov; Frank Zhengqing Wu; Ali Ramezani-Kebrya; Kimon Antonakopoulos; Dan Alistarh; Volkan Cevher

Layer-wise Quantization for Quantized Optimistic Dual Averaging

Anh Duc Nguyen, Ilia Markov, Frank Zhengqing Wu, Ali Ramezani-Kebrya, Kimon Antonakopoulos, Dan Alistarh, Volkan Cevher

TL;DR

Layer-wise Quantization for Quantized Optimistic Dual Averaging develops a general, unbiased layer-wise quantization framework with tight variance and code-length bounds and applies it to a distributed VI solver, Quantized Optimistic Dual Averaging (QODA). The approach accommodates statistical heterogeneity across layers via multiple per-layer quantization types and achieves competitive convergence with adaptive learning rates, improving communication efficiency. Theoretical guarantees provide layer-wise variance bounds and joint communication-convergence rates under absolute and relative noise, while experiments demonstrate substantial practical speedups in Wasserstein GAN training and improved compression for Transformer-XL. Overall, the work enables efficient, deployment-ready layer-aware compression for distributed optimization and adversarial training, with promising directions for non-monotone VI extensions.

Abstract

Modern deep neural networks exhibit heterogeneity across numerous layers of various types such as residuals, multi-head attention, etc., due to varying structures (dimensions, activation functions, etc.), distinct representation characteristics, which impact predictions. We develop a general layer-wise quantization framework with tight variance and code-length bounds, adapting to the heterogeneities over the course of training. We then apply a new layer-wise quantization technique within distributed variational inequalities (VIs), proposing a novel Quantized Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which achieves competitive convergence rates for monotone VIs. We empirically show that QODA achieves up to a $150\%$ speedup over the baselines in end-to-end training time for training Wasserstein GAN on $12+$ GPUs.

Layer-wise Quantization for Quantized Optimistic Dual Averaging

TL;DR

Abstract

Layer-wise Quantization for Quantized Optimistic Dual Averaging

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (75)