Table of Contents
Fetching ...

Adaptive Gradient Normalization and Independent Sampling for (Stochastic) Generalized-Smooth Optimization

Yufeng Yang, Erin Tripp, Yifan Sun, Shaofeng Zou, Yi Zhou

TL;DR

The paper addresses the gap in optimization for generalized-smooth nonconvex problems by introducing adaptive gradient normalization and an independently sampled stochastic method. It develops AN-GD to exploit generalized-PL geometry and introduces IAN-SGD, which uses independent sampling and gradient clipping to achieve $\mathcal{O}(\epsilon^{-4})$ sample complexity under relaxed noise. Theoretical results detail descent properties and convergence rates across PL-like regimes, while experiments in phase retrieval, distributionally robust optimization, and deep nets demonstrate practical advantages and robustness. The work advances first-order methods for nonconvex generalized-smooth objectives and opens avenues for combining independence and normalization with momentum or variance reduction for improved efficiency.

Abstract

Recent studies have shown that many nonconvex machine learning problems satisfy a generalized-smooth condition that extends beyond traditional smooth nonconvex optimization. However, the existing algorithms are not fully adapted to such generalized-smooth nonconvex geometry and encounter significant technical limitations on their convergence analysis. In this work, we first analyze the convergence of adaptively normalized gradient descent under function geometries characterized by generalized-smoothness and generalized PŁ condition, revealing the advantage of adaptive gradient normalization. Our results provide theoretical insights into adaptive normalization across various scenarios.For stochastic generalized-smooth nonconvex optimization, we propose \textbf{I}ndependent-\textbf{A}daptively \textbf{N}ormalized \textbf{S}tochastic \textbf{G}radient \textbf{D}escent algorithm, which leverages adaptive gradient normalization, independent sampling, and gradient clipping to achieve an $\mathcal{O}(ε^{-4})$ sample complexity under relaxed noise assumptions. Experiments on large-scale nonconvex generalized-smooth problems demonstrate the fast convergence of our algorithm.

Adaptive Gradient Normalization and Independent Sampling for (Stochastic) Generalized-Smooth Optimization

TL;DR

The paper addresses the gap in optimization for generalized-smooth nonconvex problems by introducing adaptive gradient normalization and an independently sampled stochastic method. It develops AN-GD to exploit generalized-PL geometry and introduces IAN-SGD, which uses independent sampling and gradient clipping to achieve sample complexity under relaxed noise. Theoretical results detail descent properties and convergence rates across PL-like regimes, while experiments in phase retrieval, distributionally robust optimization, and deep nets demonstrate practical advantages and robustness. The work advances first-order methods for nonconvex generalized-smooth objectives and opens avenues for combining independence and normalization with momentum or variance reduction for improved efficiency.

Abstract

Recent studies have shown that many nonconvex machine learning problems satisfy a generalized-smooth condition that extends beyond traditional smooth nonconvex optimization. However, the existing algorithms are not fully adapted to such generalized-smooth nonconvex geometry and encounter significant technical limitations on their convergence analysis. In this work, we first analyze the convergence of adaptively normalized gradient descent under function geometries characterized by generalized-smoothness and generalized PŁ condition, revealing the advantage of adaptive gradient normalization. Our results provide theoretical insights into adaptive normalization across various scenarios.For stochastic generalized-smooth nonconvex optimization, we propose \textbf{I}ndependent-\textbf{A}daptively \textbf{N}ormalized \textbf{S}tochastic \textbf{G}radient \textbf{D}escent algorithm, which leverages adaptive gradient normalization, independent sampling, and gradient clipping to achieve an sample complexity under relaxed noise assumptions. Experiments on large-scale nonconvex generalized-smooth problems demonstrate the fast convergence of our algorithm.

Paper Structure

This paper contains 34 sections, 12 theorems, 95 equations, 9 figures, 1 table.

Key Result

Lemma 1

Under Assumption assum1, function $f$ satisfies, for any $w, w' \in \mathbf{R}^d$,

Figures (9)

  • Figure 2: Experimental Results on Phase Retrieval and DRO
  • Figure 3: Experimental Result on training ResNet18, ResNet50.
  • Figure 4: Advantage of using adaptive normalization on normalized first-order algorithms
  • Figure 5: Effects of adaptive normalization on convergence of IAN-SGD
  • Figure 6: Effects of independent samples' batch size on convergence
  • ...and 4 more figures

Theorems & Definitions (25)

  • Lemma 1
  • Remark 1: $(L_0, L_1)$-generalized-smooth condition
  • Remark 2: Symmetric generalized-smooth condition
  • Theorem 1: Convergence of AN-GD
  • Remark 3
  • Lemma 2
  • Theorem 2: Convergence of IAN-SGD
  • Remark 4
  • Lemma 3: Descent inequality of IAN-SGD under generalized PŁ condition
  • Lemma 3
  • ...and 15 more