Table of Contents
Fetching ...

Towards Stability of Parameter-free Optimization

Yijiang Pang, Shuyang Yu, Bao Hoang, Jiayu Zhou

TL;DR

The paper tackles learning-rate tuning challenges in adaptive gradient optimization by introducing a parameter-free optimizer, AdamG, built on a golden step size for AdaGrad-Norm. It develops a general framework (golden step size) and two concrete instantiations, GOG and AdamG, leveraging a scale-free property to avoid problem-specific tuning. A new reliability criterion is proposed to measure cross-task stability, and extensive experiments across 42 tasks (images and NLP) show that AdamG achieves reliability and solution quality close to manually tuned Adam, with competitive convergence. The work offers a practical pathway to deploy adaptive optimizers without tuning while identifying theoretical guarantees and tail-task limitations as directions for future research.

Abstract

Hyperparameter tuning, particularly the selection of an appropriate learning rate in adaptive gradient training methods, remains a challenge. To tackle this challenge, in this paper, we propose a novel parameter-free optimizer, \textsc{AdamG} (Adam with the golden step size), designed to automatically adapt to diverse optimization problems without manual tuning. The core technique underlying \textsc{AdamG} is our golden step size derived for the AdaGrad-Norm algorithm, which is expected to help AdaGrad-Norm preserve the tuning-free convergence and approximate the optimal step size in expectation w.r.t. various optimization scenarios. To better evaluate tuning-free performance, we propose a novel evaluation criterion, \textit{reliability}, to comprehensively assess the efficacy of parameter-free optimizers in addition to classical performance criteria. Empirical results demonstrate that compared with other parameter-free baselines, \textsc{AdamG} achieves superior performance, which is consistently on par with Adam using a manually tuned learning rate across various optimization tasks.

Towards Stability of Parameter-free Optimization

TL;DR

The paper tackles learning-rate tuning challenges in adaptive gradient optimization by introducing a parameter-free optimizer, AdamG, built on a golden step size for AdaGrad-Norm. It develops a general framework (golden step size) and two concrete instantiations, GOG and AdamG, leveraging a scale-free property to avoid problem-specific tuning. A new reliability criterion is proposed to measure cross-task stability, and extensive experiments across 42 tasks (images and NLP) show that AdamG achieves reliability and solution quality close to manually tuned Adam, with competitive convergence. The work offers a practical pathway to deploy adaptive optimizers without tuning while identifying theoretical guarantees and tail-task limitations as directions for future research.

Abstract

Hyperparameter tuning, particularly the selection of an appropriate learning rate in adaptive gradient training methods, remains a challenge. To tackle this challenge, in this paper, we propose a novel parameter-free optimizer, \textsc{AdamG} (Adam with the golden step size), designed to automatically adapt to diverse optimization problems without manual tuning. The core technique underlying \textsc{AdamG} is our golden step size derived for the AdaGrad-Norm algorithm, which is expected to help AdaGrad-Norm preserve the tuning-free convergence and approximate the optimal step size in expectation w.r.t. various optimization scenarios. To better evaluate tuning-free performance, we propose a novel evaluation criterion, \textit{reliability}, to comprehensively assess the efficacy of parameter-free optimizers in addition to classical performance criteria. Empirical results demonstrate that compared with other parameter-free baselines, \textsc{AdamG} achieves superior performance, which is consistently on par with Adam using a manually tuned learning rate across various optimization tasks.
Paper Structure (25 sections, 3 theorems, 12 equations, 4 figures, 14 tables, 2 algorithms)

This paper contains 25 sections, 3 theorems, 12 equations, 4 figures, 14 tables, 2 algorithms.

Key Result

Corollary 3.3

Given Assumptions ass_lsmooth and ass_l0l1, for AdaGrad-Norm with any learning rate $\eta >0$, we have in expectation that: where $K$ denotes total steps, and $v_{K}$ is accumulated sum of the squared gradient norm (see Algorithm alg_god).

Figures (4)

  • Figure 1: CIFAR10 experiments. Note Randomly Initialized (R.I.).
  • Figure 2: CIFAR100 experiments. Note Randomly Initialized (R.I.).
  • Figure 3: Tiny-ImageNet experiments. Note Randomly Initialized (R.I.).
  • Figure 4: BERT and GPT2 under GLUE benchmark experiments.

Theorems & Definitions (5)

  • Corollary 3.3: A simple variant of Thm. 2 in wang2023convergence
  • Theorem 3.4: Example adopted from levy2017onlinegrimmer2019convergence
  • Definition 4.1: Reliability
  • Corollary C.1: a simple variant of Theorem 2 in wang2023convergence
  • proof