Table of Contents
Fetching ...

Correlated Quantization for Faster Nonconvex Distributed Optimization

Andrei Panferov, Yury Demidovich, Ahmad Rammal, Peter Richtárik

TL;DR

The paper addresses communication bottlenecks in distributed nonconvex optimization by integrating Correlated Quantizers (CQ) into MARINA and developing a weighted AB-inequality framework with Hessian-variance $L_{\pm}$. It demonstrates that correlation can substantially reduce the mean-squared error of compressed gradients and, in the zero-Hessian-variance regime, yields superior communication complexity bounds compared to Independent Quantizers (IQ), with MARINA+CQ achieving $\mathcal{C}_{\text{cor}} = O\left(\frac{\Delta^0 L}{\varepsilon^2} \min\{ d, 1+\frac{d}{n} \} \right)$ while $\mathcal{C}_{\text{ind}} = O\left(\frac{\Delta^0 L}{\varepsilon^2} \min\{ d, 1+\frac{d}{\sqrt{n}} \} \right)$; the ratio can reach about 7.29 for $d=n$. The work also introduces a PermK+CQ compressor and an importance-sampling variant, extending the theory to biased and correlated compressors beyond unbiased assumptions, and validates these findings with extensive experiments on quadratic and nonconvex tasks. This advances practical, communication-efficient distributed nonconvex optimization by leveraging correlation structure in compression and expanding MARINA’s applicability. Practically, these results enable faster training with lower communication budgets in large-scale federated and distributed learning scenarios.

Abstract

Quantization (Alistarh et al., 2017) is an important (stochastic) compression technique that reduces the volume of transmitted bits during each communication round in distributed model training. Suresh et al. (2022) introduce correlated quantizers and show their advantages over independent counterparts by analyzing distributed SGD communication complexity. We analyze the forefront distributed non-convex optimization algorithm MARINA (Gorbunov et al., 2022) utilizing the proposed correlated quantizers and show that it outperforms the original MARINA and distributed SGD of Suresh et al. (2022) with regard to the communication complexity. We significantly refine the original analysis of MARINA without any additional assumptions using the weighted Hessian variance (Tyurin et al., 2022), and then we expand the theoretical framework of MARINA to accommodate a substantially broader range of potentially correlated and biased compressors, thus dilating the applicability of the method beyond the conventional independent unbiased compressor setup. Extensive experimental results corroborate our theoretical findings.

Correlated Quantization for Faster Nonconvex Distributed Optimization

TL;DR

The paper addresses communication bottlenecks in distributed nonconvex optimization by integrating Correlated Quantizers (CQ) into MARINA and developing a weighted AB-inequality framework with Hessian-variance . It demonstrates that correlation can substantially reduce the mean-squared error of compressed gradients and, in the zero-Hessian-variance regime, yields superior communication complexity bounds compared to Independent Quantizers (IQ), with MARINA+CQ achieving while ; the ratio can reach about 7.29 for . The work also introduces a PermK+CQ compressor and an importance-sampling variant, extending the theory to biased and correlated compressors beyond unbiased assumptions, and validates these findings with extensive experiments on quadratic and nonconvex tasks. This advances practical, communication-efficient distributed nonconvex optimization by leveraging correlation structure in compression and expanding MARINA’s applicability. Practically, these results enable faster training with lower communication budgets in large-scale federated and distributed learning scenarios.

Abstract

Quantization (Alistarh et al., 2017) is an important (stochastic) compression technique that reduces the volume of transmitted bits during each communication round in distributed model training. Suresh et al. (2022) introduce correlated quantizers and show their advantages over independent counterparts by analyzing distributed SGD communication complexity. We analyze the forefront distributed non-convex optimization algorithm MARINA (Gorbunov et al., 2022) utilizing the proposed correlated quantizers and show that it outperforms the original MARINA and distributed SGD of Suresh et al. (2022) with regard to the communication complexity. We significantly refine the original analysis of MARINA without any additional assumptions using the weighted Hessian variance (Tyurin et al., 2022), and then we expand the theoretical framework of MARINA to accommodate a substantially broader range of potentially correlated and biased compressors, thus dilating the applicability of the method beyond the conventional independent unbiased compressor setup. Extensive experimental results corroborate our theoretical findings.
Paper Structure (37 sections, 15 theorems, 68 equations, 6 figures, 4 tables, 6 algorithms)

This paper contains 37 sections, 15 theorems, 68 equations, 6 figures, 4 tables, 6 algorithms.

Key Result

Proposition 1

If, for all $i\in[n],$$\mathcal{Q}_i\in\mathbb{U}\left(\omega_i\right)$ and $\{\mathcal{Q}_i\}_{i=1}^n \in \mathbb{U}_{\rm ind},$ then $\{\mathcal{Q}_i\}_{i=1}^n\in\mathbb{U}\left(\max_i\{\omega_i\},0\right).$ If we further assume that the compressors are independent, then $\{\mathcal{Q}_i\}_{i=1}^n

Figures (6)

  • Figure 1: Comparison of CQ, IQ, and DRIVE with $\mathsf{MARINA}$ on quadratic optimization tasks with diverse $L_\pm$ values
  • Figure 2: Comparison of CQ, IQ and DRIVE with $\mathsf{MARINA}$ on LibSVM datasets. The points represent the uncompressed rounds of the algorithm
  • Figure 3: Comparison of PermK+CQ, CQ and DRIVE with $\mathsf{MARINA}$ on quadratic optimization task with $L_\pm=0$
  • Figure 4: (a)/(b): Logarithmic speedup of $\mathsf{MARINA}$ with Correlated/Uncorrelated Quantization over Gradient Descent. (c): Logarithmic speedup of $\mathsf{MARINA}+$CQ compared to $\mathsf{MARINA}+$IQ
  • Figure 5: Comparison of DRIVE with or without Importance Sampling (IS) with $\mathsf{MARINA}$ on quadratic optimization tasks with diverse $L_\pm$ values
  • ...and 1 more figures

Theorems & Definitions (31)

  • Definition 1
  • Definition 2
  • Definition 3: Natural dithering
  • Proposition 1
  • Definition 4: Hessian Variance
  • Proposition 2
  • Definition 5
  • Proposition 3
  • Definition 6
  • Corollary 1
  • ...and 21 more