Table of Contents
Fetching ...

Provable Benefit of Random Permutations over Uniform Sampling in Stochastic Coordinate Descent

Donghwa Kim, Jaewook Lee, Chulhee Yun

TL;DR

The paper investigates convergence of two stochastic coordinate descent variants, RCD and RPCD, for smooth convex quadratic objectives. It proves that, for a class of quadratics with permutation-invariant Hessians, the RPCD contraction upper bound is strictly better than the RCD contraction lower bound, yielding a provable performance gap. By strengthening RCD lower bounds and matching RPCD upper bounds within this class, the authors show RPCD outperforms RCD on every instance in the class and conjecture this extends to all positive-definite quadratics. The work combines spectral-operator analysis, a dimension-reduction approach for permutation-invariant Hessians, and algorithmic searches with experiments to illustrate practical gains, offering a rigorous justification for the empirical success of random permutations in coordinate descent.

Abstract

We analyze the convergence rates of two popular variants of coordinate descent (CD): random CD (RCD), in which the coordinates are sampled uniformly at random, and random-permutation CD (RPCD), in which random permutations are used to select the update indices. Despite abundant empirical evidence that RPCD outperforms RCD in various tasks, the theoretical gap between the two algorithms' performance has remained elusive. Even for the benign case of positive-definite quadratic functions with permutation-invariant Hessians, previous efforts have failed to demonstrate a provable performance gap between RCD and RPCD. To this end, we present novel results showing that, for a class of quadratics with permutation-invariant structures, the contraction rate upper bound for RPCD is always strictly smaller than the contraction rate lower bound for RCD for every individual problem instance. Furthermore, we conjecture that this function class contains the worst-case examples of RPCD among all positive-definite quadratics. Combined with our RCD lower bound, this conjecture extends our results to the general class of positive-definite quadratic functions.

Provable Benefit of Random Permutations over Uniform Sampling in Stochastic Coordinate Descent

TL;DR

The paper investigates convergence of two stochastic coordinate descent variants, RCD and RPCD, for smooth convex quadratic objectives. It proves that, for a class of quadratics with permutation-invariant Hessians, the RPCD contraction upper bound is strictly better than the RCD contraction lower bound, yielding a provable performance gap. By strengthening RCD lower bounds and matching RPCD upper bounds within this class, the authors show RPCD outperforms RCD on every instance in the class and conjecture this extends to all positive-definite quadratics. The work combines spectral-operator analysis, a dimension-reduction approach for permutation-invariant Hessians, and algorithmic searches with experiments to illustrate practical gains, offering a rigorous justification for the empirical success of random permutations in coordinate descent.

Abstract

We analyze the convergence rates of two popular variants of coordinate descent (CD): random CD (RCD), in which the coordinates are sampled uniformly at random, and random-permutation CD (RPCD), in which random permutations are used to select the update indices. Despite abundant empirical evidence that RPCD outperforms RCD in various tasks, the theoretical gap between the two algorithms' performance has remained elusive. Even for the benign case of positive-definite quadratic functions with permutation-invariant Hessians, previous efforts have failed to demonstrate a provable performance gap between RCD and RPCD. To this end, we present novel results showing that, for a class of quadratics with permutation-invariant structures, the contraction rate upper bound for RPCD is always strictly smaller than the contraction rate lower bound for RCD for every individual problem instance. Furthermore, we conjecture that this function class contains the worst-case examples of RPCD among all positive-definite quadratics. Combined with our RCD lower bound, this conjecture extends our results to the general class of positive-definite quadratic functions.

Paper Structure

This paper contains 47 sections, 26 theorems, 299 equations, 15 figures, 2 tables, 1 algorithm.

Key Result

Lemma 2.2

The matrix operators ${\mathcal{M}}_{{\bm{A}}}^{\emph{RCD}}$ and ${\mathcal{M}}_{{\bm{A}}}^{\emph{RPCD}}$ are both diagonalizable.

Figures (15)

  • Figure 1: Performance comparison of RCD and RPCD. We use the objective function $f({\bm{x}}) = \frac{1}{2} {\bm{x}}^{\top} {\bm{A}} {\bm{x}}$ with ${\bm{A}} = \sigma {\bm{I}} + (1 - \sigma) {\bm{1}} {\bm{1}}^{\top}$. For this plot we use $\sigma = 0.7$ and dimension $n = 25$.
  • Figure 2: Plots of $\rho(\left. {\mathcal{M}}_{{\bm{A}}_{n}}^{\text{RPCD}} \right|_{{\mathcal{S}}})$ (blue), $\max_{2 \le k \le n} \rho(\left. {\mathcal{M}}_{{\bm{A}}_{k}}^{\text{RPCD}} \right|_{{\mathcal{S}}})$ (yellow), the RPCD upper bound in \ref{['thm:rpcdub']} (green), and the $n$-th power (for fair comparison) of the stronger RCD lower bound for ${\bm{A}} \in {\mathcal{A}}_{\sigma}$ in \ref{['thm:rcdlbpi']} (red).
  • Figure 3: Numerical experiments comparing RCD and RPCD. We plot the mean values and min-max range over multiple trials of RCD/RPCD. For (i)-(iii), we use $n = 25$, $\sigma \in \{0.3, 0.7\}$. The experiments are conducted on (i) a quadratic function with ${\bm{A}} \in {\mathcal{A}}_{\sigma}$, (ii) a random quadratic with $\lambda_{\min} ({\bm{A}}) = \sigma$, (iii) a random quadratic + (scaled and transformed) LSE function, and (iv) an $\ell_2$-regularized sparse logistic regression objective with $n=100$.
  • Figure 4: Plot of $\lVert{\bm{A}}^{-\frac{1}{2}} {\mathcal{M}}_{{\bm{A}}} ({\bm{A}}) {\bm{A}}^{-\frac{1}{2}}\rVert$$({\bm{A}} = \sigma {\bm{I}} + (1 - \sigma) {\bm{1}} {\bm{1}}^{\top})$ and the RPCD upper bound from \ref{['thm:rpcdub']} for $n = 100$.
  • Figure 5: RCD vs RPCD, permutation-invariant quadratic functions. ${\bm{A}}=\sigma {\bm{I}} + (1-\sigma){\bm{1}} {\bm{1}}^{\top}, n=25, \sigma \in \{0.1,\dots,0.9\}.$ The $y$-axis is in log scale.
  • ...and 10 more figures

Theorems & Definitions (53)

  • Definition 2.1
  • Remark
  • Remark
  • Remark
  • Lemma 2.2
  • Theorem 3.1
  • Remark
  • Definition 3.2
  • Theorem 3.3
  • Remark
  • ...and 43 more