Table of Contents
Fetching ...

Block Acceleration Without Momentum: On Optimal Stepsizes of Block Gradient Descent for Least-Squares

Liangzu Peng, Wotao Yin

TL;DR

This work analyzes a two-block least-squares problem and shows that block gradient descent (BGD) can be strictly faster than both vanilla gradient descent (GD) and Polyak's momentum method (HB) when the block-orthogonality condition is satisfied. By minimizing the spectral radius of the two-block update matrix $M(\gamma_1,\gamma_2)$, the authors derive closed-form optimal stepsizes and show $\rho_{BGD}^* \le (\rho_{HB}^*)^2 < (\rho_{GD}^*)^2$, with distinct expressions for full-rank and rank-deficient cross-block interactions. The analysis hinges on a Spectrum Lemma that characterizes the eigenstructure of $M$, and it decouples nicely into simplified cases (e.g., $\gamma_1=1$ or $\gamma_2=1$, and $\gamma_1=\gamma_2$) before solving the general region-based optimization. The results highlight the role of cross-smoothness, offer insights into acceleration without momentum, and suggest numerical avenues (e.g., SDP) to extend the approach to broader problems. Overall, the paper connects block-coupled optimization dynamics to spectral properties, revealing when and how BG D can outperform momentum-based methods in least-squares settings.

Abstract

Block coordinate descent is a powerful algorithmic template suitable for big data optimization. This template admits a lot of variants including block gradient descent (BGD), which performs gradient descent on a selected block of variables, while keeping other variables fixed. For a very long time, the stepsize for each block has tacitly been set to one divided by the block-wise Lipschitz smoothness constant, imitating the vanilla stepsize rule for gradient descent (GD). However, such a choice for BGD has not yet been able to theoretically justify its empirical superiority over GD, as existing convergence rates for BGD have worse constants than GD in the deterministic cases. To discover such theoretical justification, we set up a simple environment where we consider BGD applied to least-squares with two blocks of variables. Assuming the data matrix corresponding to each block is orthogonal, we find optimal stepsizes of BGD in closed form, which provably lead to asymptotic convergence rates twice as fast as GD with Polyak's momentum; this means, under that orthogonality assumption, one can accelerate BGD by just tuning stepsizes and without adding any momentum. An application that satisfies this assumption is \textit{generalized alternating projection} between two subspaces, and applying our stepsizes to it improves the prior convergence rate that was once claimed, slightly inaccurately, to be optimal. The main proof idea is to minimize, in stepsize variables, the spectral radius of a matrix that controls convergence rates.

Block Acceleration Without Momentum: On Optimal Stepsizes of Block Gradient Descent for Least-Squares

TL;DR

This work analyzes a two-block least-squares problem and shows that block gradient descent (BGD) can be strictly faster than both vanilla gradient descent (GD) and Polyak's momentum method (HB) when the block-orthogonality condition is satisfied. By minimizing the spectral radius of the two-block update matrix , the authors derive closed-form optimal stepsizes and show , with distinct expressions for full-rank and rank-deficient cross-block interactions. The analysis hinges on a Spectrum Lemma that characterizes the eigenstructure of , and it decouples nicely into simplified cases (e.g., or , and ) before solving the general region-based optimization. The results highlight the role of cross-smoothness, offer insights into acceleration without momentum, and suggest numerical avenues (e.g., SDP) to extend the approach to broader problems. Overall, the paper connects block-coupled optimization dynamics to spectral properties, revealing when and how BG D can outperform momentum-based methods in least-squares settings.

Abstract

Block coordinate descent is a powerful algorithmic template suitable for big data optimization. This template admits a lot of variants including block gradient descent (BGD), which performs gradient descent on a selected block of variables, while keeping other variables fixed. For a very long time, the stepsize for each block has tacitly been set to one divided by the block-wise Lipschitz smoothness constant, imitating the vanilla stepsize rule for gradient descent (GD). However, such a choice for BGD has not yet been able to theoretically justify its empirical superiority over GD, as existing convergence rates for BGD have worse constants than GD in the deterministic cases. To discover such theoretical justification, we set up a simple environment where we consider BGD applied to least-squares with two blocks of variables. Assuming the data matrix corresponding to each block is orthogonal, we find optimal stepsizes of BGD in closed form, which provably lead to asymptotic convergence rates twice as fast as GD with Polyak's momentum; this means, under that orthogonality assumption, one can accelerate BGD by just tuning stepsizes and without adding any momentum. An application that satisfies this assumption is \textit{generalized alternating projection} between two subspaces, and applying our stepsizes to it improves the prior convergence rate that was once claimed, slightly inaccurately, to be optimal. The main proof idea is to minimize, in stepsize variables, the spectral radius of a matrix that controls convergence rates.
Paper Structure (17 sections, 22 theorems, 115 equations, 2 figures, 1 table)

This paper contains 17 sections, 22 theorems, 115 equations, 2 figures, 1 table.

Key Result

Theorem 1.1

\newlabeltheorem:informal0 Suppose assumption:BWO below holds. Run eq:BGD and accelerated GD with Polyak's momentum (i.e., the heavy ball method), respectively, with their "optimal" stepsizes. eq:BGD is twice as fast as the heavy ball method (HB).

Figures (2)

  • Figure 1: We divide the quadrant $\{ (\gamma_1,\gamma_2):\gamma_1>0,\gamma_2>0 \}$ of all possible stepsizes into four regions, namely $S_{00},S_{01}, S_{10}, S_{11}$. We minimize the spectral radius $\rho(\bm{M}(\gamma_1,\gamma_2) )$ over each region separately, which will give a solution to \ref{['eq:minimize-sr']}.
  • Figure 2: Under \ref{['assumption:BWO']}, \ref{['figa:sr']} shows the numerical values of the minimum spectral radii, $\rho_{\textnormal{GD}}^*, \rho_{\textnormal{HB}}^*$, and $\rho_{\textnormal{BGD}}^*$, and \ref{['fig:convergence10', 'fig:convergence1000', 'fig:convergence100000']} shows the errors of the three methods at every iteration $t$. TODO: MODIFY FIGURE 2a AND CAPTION \newlabelfig:sr-convergence0

Theorems & Definitions (41)

  • Theorem 1.1: Informal
  • Lemma 2.1
  • Proof 1: Proof of \ref{['lemma:error-decrease']}
  • Lemma 2.2
  • Lemma 2.3
  • Proof 2: Proof of \ref{['lemma:rho-GD-eigs-C']}
  • Example 1
  • Lemma 3.1: Spectrum of $\bm{M}(\gamma_1,\gamma_2)$
  • Remark 3.2
  • Remark 3.3
  • ...and 31 more