Table of Contents
Fetching ...

Layer-wise Quantization for Quantized Optimistic Dual Averaging

Anh Duc Nguyen, Ilia Markov, Frank Zhengqing Wu, Ali Ramezani-Kebrya, Kimon Antonakopoulos, Dan Alistarh, Volkan Cevher

TL;DR

Layer-wise Quantization for Quantized Optimistic Dual Averaging develops a general, unbiased layer-wise quantization framework with tight variance and code-length bounds and applies it to a distributed VI solver, Quantized Optimistic Dual Averaging (QODA). The approach accommodates statistical heterogeneity across layers via multiple per-layer quantization types and achieves competitive convergence with adaptive learning rates, improving communication efficiency. Theoretical guarantees provide layer-wise variance bounds and joint communication-convergence rates under absolute and relative noise, while experiments demonstrate substantial practical speedups in Wasserstein GAN training and improved compression for Transformer-XL. Overall, the work enables efficient, deployment-ready layer-aware compression for distributed optimization and adversarial training, with promising directions for non-monotone VI extensions.

Abstract

Modern deep neural networks exhibit heterogeneity across numerous layers of various types such as residuals, multi-head attention, etc., due to varying structures (dimensions, activation functions, etc.), distinct representation characteristics, which impact predictions. We develop a general layer-wise quantization framework with tight variance and code-length bounds, adapting to the heterogeneities over the course of training. We then apply a new layer-wise quantization technique within distributed variational inequalities (VIs), proposing a novel Quantized Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which achieves competitive convergence rates for monotone VIs. We empirically show that QODA achieves up to a $150\%$ speedup over the baselines in end-to-end training time for training Wasserstein GAN on $12+$ GPUs.

Layer-wise Quantization for Quantized Optimistic Dual Averaging

TL;DR

Layer-wise Quantization for Quantized Optimistic Dual Averaging develops a general, unbiased layer-wise quantization framework with tight variance and code-length bounds and applies it to a distributed VI solver, Quantized Optimistic Dual Averaging (QODA). The approach accommodates statistical heterogeneity across layers via multiple per-layer quantization types and achieves competitive convergence with adaptive learning rates, improving communication efficiency. Theoretical guarantees provide layer-wise variance bounds and joint communication-convergence rates under absolute and relative noise, while experiments demonstrate substantial practical speedups in Wasserstein GAN training and improved compression for Transformer-XL. Overall, the work enables efficient, deployment-ready layer-aware compression for distributed optimization and adversarial training, with promising directions for non-monotone VI extensions.

Abstract

Modern deep neural networks exhibit heterogeneity across numerous layers of various types such as residuals, multi-head attention, etc., due to varying structures (dimensions, activation functions, etc.), distinct representation characteristics, which impact predictions. We develop a general layer-wise quantization framework with tight variance and code-length bounds, adapting to the heterogeneities over the course of training. We then apply a new layer-wise quantization technique within distributed variational inequalities (VIs), proposing a novel Quantized Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which achieves competitive convergence rates for monotone VIs. We empirically show that QODA achieves up to a speedup over the baselines in end-to-end training time for training Wasserstein GAN on GPUs.

Paper Structure

This paper contains 44 sections, 41 theorems, 205 equations, 4 figures, 3 tables, 1 algorithm.

Key Result

Theorem 5.1

With unbiased layer-wise quantization with $L^q$ normalization of a vector ${\bm{v}} \in \mathbb{R}^d$, i.e. $\mathbb{E}_{q_{\mathbb{L}^M}} [Q_{\mathbb{L}^M}({\bm{v}})] = {\bm{v}}$, we have that where $\varepsilon_Q = \frac{(\bar{\ell}^M-1)^2 }{ 4\bar{\ell}^M} + (\bar{\ell}_1^M d^{\frac{1}{\min\{q,2\}}}-1) \mathds{1}\{ d \geq d_{th} \} + \frac{(\bar{\ell}_1^M)^{2}}{4} d^{\frac{2}{\min\{q,2\}}}

Figures (4)

  • Figure 1: A Visualization for Layer-wise vs Global Quantization
  • Figure 2: CIFAR10
  • Figure 3: CIFAR100
  • Figure 5: Ablation Study for Transformer-XL

Theorems & Definitions (75)

  • Remark 2.6
  • Remark 3.1
  • Remark 3.2
  • Remark 3.3
  • Remark 4.1
  • Theorem 5.1: Variance Bound
  • Remark 5.2
  • Theorem 5.3: Code-length Bound
  • Remark 5.4
  • Theorem 5.5: \ref{['alg:Q-OptDA+X']} under Absolute Noise
  • ...and 65 more