Parameter-free Clipped Gradient Descent Meets Polyak

Yuki Takezawa; Han Bao; Ryoma Sato; Kenta Niwa; Makoto Yamada

Parameter-free Clipped Gradient Descent Meets Polyak

Yuki Takezawa, Han Bao, Ryoma Sato, Kenta Niwa, Makoto Yamada

TL;DR

This study proposes Inexact Polyak Stepsize, which converges to the optimal solution without any hyperparameters tuning, and its convergence rate is asymptotically independent of $L$ under $L$-smooth and $(L_0, L_1)$-smooth assumptions of the loss function, similar to that of clipped gradient descent with well-tuned hyperparameters.

Abstract

Gradient descent and its variants are de facto standard algorithms for training machine learning models. As gradient descent is sensitive to its hyperparameters, we need to tune the hyperparameters carefully using a grid search. However, the method is time-consuming, particularly when multiple hyperparameters exist. Therefore, recent studies have analyzed parameter-free methods that adjust the hyperparameters on the fly. However, the existing work is limited to investigations of parameter-free methods for the stepsize, and parameter-free methods for other hyperparameters have not been explored. For instance, although the gradient clipping threshold is a crucial hyperparameter in addition to the stepsize for preventing gradient explosion issues, none of the existing studies have investigated parameter-free methods for clipped gradient descent. Therefore, in this study, we investigate the parameter-free methods for clipped gradient descent. Specifically, we propose Inexact Polyak Stepsize, which converges to the optimal solution without any hyperparameters tuning, and its convergence rate is asymptotically independent of $L$ under $L$-smooth and $(L_0, L_1)$-smooth assumptions of the loss function, similar to that of clipped gradient descent with well-tuned hyperparameters. We numerically validated our convergence results using a synthetic function and demonstrated the effectiveness of our proposed methods using LSTM, Nano-GPT, and T5.

Parameter-free Clipped Gradient Descent Meets Polyak

TL;DR

This study proposes Inexact Polyak Stepsize, which converges to the optimal solution without any hyperparameters tuning, and its convergence rate is asymptotically independent of

under

-smooth and

-smooth assumptions of the loss function, similar to that of clipped gradient descent with well-tuned hyperparameters.

Abstract

under

-smooth and

-smooth assumptions of the loss function, similar to that of clipped gradient descent with well-tuned hyperparameters. We numerically validated our convergence results using a synthetic function and demonstrated the effectiveness of our proposed methods using LSTM, Nano-GPT, and T5.

Paper Structure (32 sections, 14 theorems, 64 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 32 sections, 14 theorems, 64 equations, 5 figures, 7 tables, 1 algorithm.

Introduction
Preliminary
Gradient descent & $L$-smoothness
Clipped gradient descent & $(L_0, L_1)$-smoothness
Polyak stepsize
Improved convergence result of Polyak stepsize
Connection between Polyak stepsize and clipped gradient descent
Convergence analysis of Polyak stepsize under $(L_0, L_1)$-smoothness
Making clipped gradient descent parameter-free
Inexact Polyak Stepsize
Convergence analysis of Inexact Polyak Stepsize
Asymptotic independence of $L$:
Removing dependence on $D_T$:
Convergence rate with respect to $T$:
Related work
...and 17 more sections

Key Result

Theorem 1

Assume that $f$ is convex and $L$-smooth, and there exists an optimal solution ${\bm{x}}^\star \coloneqq \mathop{\mathrm{arg\,min}}\limits_{{\bm{x}} \in \mathbb{R}^d} f ({\bm{x}})$. Then, gradient descent with stepsize $\eta_t = \tfrac{1}{L}$ satisfies where $\bar{{\bm{x}}} \coloneqq \tfrac{1}{T}\sum_{t=0}^{T-1} {\bm{x}}_t$ and $T$ is the number of iterations.

Figures (5)

Figure 1: Convergence behaviors of various methods with the synthetic function.
Figure 2: The final test loss with various hyperparameter settings. For T5, the results of DecSPS and AdaSPS were omitted because their final test loss was much larger than the others, as shown in Fig. \ref{['fig:lstm']}. Furthermore, the results of SGD were also omitted when the final test loss became nan or infinity.
Figure 3: Loss curves for LSTM, Nano-GPT, and T5. We plotted the training loss per $100$, $10$, and $10$ iterations for LSTM, Nano-GPT, and T5, respectively. We plotted the test loss per one epoch, $100$ iterations, and $200$ iterations, respectively. For LSTM and Nano-GPT, we found that Polyak stepsize does not converge, and its loss was much larger than that of other comparison methods. Thus, to make the figure easier to read, we omit the results of Polyak stepsize and provide the complete results, including Polyak stepsize in Sec. \ref{['sec:additional_experiments']}.
Figure 4: Loss curves for LSTM and Nano-GPT. We plotted the training loss per $100$, $10$, and $10$ iterations for LSTM, Nano-GPT, and T5, respectively. We plotted the test loss per one epoch, $100$ iterations, and $200$ iterations, respectively.
Figure 5: Loss curves for Nano-GPT with different $T$.

Theorems & Definitions (22)

Theorem 1: nesterov2018lectures
Theorem 2: koloskova2023revisiting
Theorem 3: hazan2019revisiting
Proposition 1
proof
Proposition 2: jiang2023adaptive
Theorem 4
Theorem 5
Lemma 1
proof
...and 12 more

Parameter-free Clipped Gradient Descent Meets Polyak

TL;DR

Abstract

Parameter-free Clipped Gradient Descent Meets Polyak

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (22)