Correlated Quantization for Faster Nonconvex Distributed Optimization

Andrei Panferov; Yury Demidovich; Ahmad Rammal; Peter Richtárik

Correlated Quantization for Faster Nonconvex Distributed Optimization

Andrei Panferov, Yury Demidovich, Ahmad Rammal, Peter Richtárik

TL;DR

The paper addresses communication bottlenecks in distributed nonconvex optimization by integrating Correlated Quantizers (CQ) into MARINA and developing a weighted AB-inequality framework with Hessian-variance $L_{\pm}$. It demonstrates that correlation can substantially reduce the mean-squared error of compressed gradients and, in the zero-Hessian-variance regime, yields superior communication complexity bounds compared to Independent Quantizers (IQ), with MARINA+CQ achieving $\mathcal{C}_{\text{cor}} = O\left(\frac{\Delta^0 L}{\varepsilon^2} \min\{ d, 1+\frac{d}{n} \} \right)$ while $\mathcal{C}_{\text{ind}} = O\left(\frac{\Delta^0 L}{\varepsilon^2} \min\{ d, 1+\frac{d}{\sqrt{n}} \} \right)$; the ratio can reach about 7.29 for $d=n$. The work also introduces a PermK+CQ compressor and an importance-sampling variant, extending the theory to biased and correlated compressors beyond unbiased assumptions, and validates these findings with extensive experiments on quadratic and nonconvex tasks. This advances practical, communication-efficient distributed nonconvex optimization by leveraging correlation structure in compression and expanding MARINA’s applicability. Practically, these results enable faster training with lower communication budgets in large-scale federated and distributed learning scenarios.

Abstract

Quantization (Alistarh et al., 2017) is an important (stochastic) compression technique that reduces the volume of transmitted bits during each communication round in distributed model training. Suresh et al. (2022) introduce correlated quantizers and show their advantages over independent counterparts by analyzing distributed SGD communication complexity. We analyze the forefront distributed non-convex optimization algorithm MARINA (Gorbunov et al., 2022) utilizing the proposed correlated quantizers and show that it outperforms the original MARINA and distributed SGD of Suresh et al. (2022) with regard to the communication complexity. We significantly refine the original analysis of MARINA without any additional assumptions using the weighted Hessian variance (Tyurin et al., 2022), and then we expand the theoretical framework of MARINA to accommodate a substantially broader range of potentially correlated and biased compressors, thus dilating the applicability of the method beyond the conventional independent unbiased compressor setup. Extensive experimental results corroborate our theoretical findings.

Correlated Quantization for Faster Nonconvex Distributed Optimization

TL;DR

. It demonstrates that correlation can substantially reduce the mean-squared error of compressed gradients and, in the zero-Hessian-variance regime, yields superior communication complexity bounds compared to Independent Quantizers (IQ), with MARINA+CQ achieving

while

; the ratio can reach about 7.29 for

. The work also introduces a PermK+CQ compressor and an importance-sampling variant, extending the theory to biased and correlated compressors beyond unbiased assumptions, and validates these findings with extensive experiments on quadratic and nonconvex tasks. This advances practical, communication-efficient distributed nonconvex optimization by leveraging correlation structure in compression and expanding MARINA’s applicability. Practically, these results enable faster training with lower communication budgets in large-scale federated and distributed learning scenarios.

Abstract

Paper Structure (37 sections, 15 theorems, 68 equations, 6 figures, 4 tables, 6 algorithms)

This paper contains 37 sections, 15 theorems, 68 equations, 6 figures, 4 tables, 6 algorithms.

INTRODUCTION
CONTRIBUTIONS
MAIN RESULTS
AB-Inequality: Better Control of MSE
Why Correlation May Help
Zero-Hessian-Variance Regime
Superior Quantizers for $\mathsf{MARINA}$
Combination with Sparsification
EXPERIMENTS
Quadratic Optimization Tasks with Various Hessian Variances $L_\pm$
Non-Convex Logistic Regression
Combination with PermK
Numerical Complexity Analysis in the d-n Plane
CONCLUSIONS AND FUTURE WORK
Quantizers in Homogeneous Data Regime
...and 22 more sections

Key Result

Proposition 1

If, for all $i\in[n],$$\mathcal{Q}_i\in\mathbb{U}\left(\omega_i\right)$ and $\{\mathcal{Q}_i\}_{i=1}^n \in \mathbb{U}_{\rm ind},$ then $\{\mathcal{Q}_i\}_{i=1}^n\in\mathbb{U}\left(\max_i\{\omega_i\},0\right).$ If we further assume that the compressors are independent, then $\{\mathcal{Q}_i\}_{i=1}^n

Figures (6)

Figure 1: Comparison of CQ, IQ, and DRIVE with $\mathsf{MARINA}$ on quadratic optimization tasks with diverse $L_\pm$ values
Figure 2: Comparison of CQ, IQ and DRIVE with $\mathsf{MARINA}$ on LibSVM datasets. The points represent the uncompressed rounds of the algorithm
Figure 3: Comparison of PermK+CQ, CQ and DRIVE with $\mathsf{MARINA}$ on quadratic optimization task with $L_\pm=0$
Figure 4: (a)/(b): Logarithmic speedup of $\mathsf{MARINA}$ with Correlated/Uncorrelated Quantization over Gradient Descent. (c): Logarithmic speedup of $\mathsf{MARINA}+$CQ compared to $\mathsf{MARINA}+$IQ
Figure 5: Comparison of DRIVE with or without Importance Sampling (IS) with $\mathsf{MARINA}$ on quadratic optimization tasks with diverse $L_\pm$ values
...and 1 more figures

Theorems & Definitions (31)

Definition 1
Definition 2
Definition 3: Natural dithering
Proposition 1
Definition 4: Hessian Variance
Proposition 2
Definition 5
Proposition 3
Definition 6
Corollary 1
...and 21 more

Correlated Quantization for Faster Nonconvex Distributed Optimization

TL;DR

Abstract

Correlated Quantization for Faster Nonconvex Distributed Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (31)