Table of Contents
Fetching ...

Gradient Descent with Provably Tuned Learning-rate Schedules

Dravyansh Sharma

TL;DR

This work develops a data-driven framework for provably tuning gradient-descent hyperparameters in non-convex and non-smooth settings, extending beyond convex/smooth assumptions to piecewise-polynomial and Pfaffian function classes (including common neural activations). By analyzing the dual cost as a function of hyperparameters and leveraging the GJ framework and pseudo-dimension, the authors derive finite-sample guarantees for learning step sizes, learning-rate schedules, and initialization scales, as well as for momentum-based variants. The results yield explicit (polylogarithmic to polynomial) sample-complexity bounds and extend to learning across multiple tasks and to optimization of validation loss, enabling practical data-driven hyperparameter tuning with theoretical guarantees. The framework broadens the scope of theoretically grounded hyperparameter tuning for neural networks and related gradient-based methods, with implications for pre-training and multi-task learning.

Abstract

Gradient-based iterative optimization methods are the workhorse of modern machine learning. They crucially rely on careful tuning of parameters like learning rate and momentum. However, one typically sets them using heuristic approaches without formal near-optimality guarantees. Recent work by Gupta and Roughgarden studies how to learn a good step-size in gradient descent. However, like most of the literature with theoretical guarantees for gradient-based optimization, their results rely on strong assumptions on the function class including convexity and smoothness which do not hold in typical applications. In this work, we develop novel analytical tools for provably tuning hyperparameters in gradient-based algorithms that apply to non-convex and non-smooth functions. We obtain matching sample complexity bounds for learning the step-size in gradient descent shown for smooth, convex functions in prior work (up to logarithmic factors) but for a much broader class of functions. Our analysis applies to gradient descent on neural networks with commonly used activation functions (including ReLU, sigmoid and tanh). We extend our framework to tuning multiple hyperparameters, including tuning the learning rate schedule, simultaneously tuning momentum and step-size, and pre-training the initialization vector. Our approach can be used to bound the sample complexity for minimizing both the validation loss as well as the number of gradient descent iterations.

Gradient Descent with Provably Tuned Learning-rate Schedules

TL;DR

This work develops a data-driven framework for provably tuning gradient-descent hyperparameters in non-convex and non-smooth settings, extending beyond convex/smooth assumptions to piecewise-polynomial and Pfaffian function classes (including common neural activations). By analyzing the dual cost as a function of hyperparameters and leveraging the GJ framework and pseudo-dimension, the authors derive finite-sample guarantees for learning step sizes, learning-rate schedules, and initialization scales, as well as for momentum-based variants. The results yield explicit (polylogarithmic to polynomial) sample-complexity bounds and extend to learning across multiple tasks and to optimization of validation loss, enabling practical data-driven hyperparameter tuning with theoretical guarantees. The framework broadens the scope of theoretically grounded hyperparameter tuning for neural networks and related gradient-based methods, with implications for pre-training and multi-task learning.

Abstract

Gradient-based iterative optimization methods are the workhorse of modern machine learning. They crucially rely on careful tuning of parameters like learning rate and momentum. However, one typically sets them using heuristic approaches without formal near-optimality guarantees. Recent work by Gupta and Roughgarden studies how to learn a good step-size in gradient descent. However, like most of the literature with theoretical guarantees for gradient-based optimization, their results rely on strong assumptions on the function class including convexity and smoothness which do not hold in typical applications. In this work, we develop novel analytical tools for provably tuning hyperparameters in gradient-based algorithms that apply to non-convex and non-smooth functions. We obtain matching sample complexity bounds for learning the step-size in gradient descent shown for smooth, convex functions in prior work (up to logarithmic factors) but for a much broader class of functions. Our analysis applies to gradient descent on neural networks with commonly used activation functions (including ReLU, sigmoid and tanh). We extend our framework to tuning multiple hyperparameters, including tuning the learning rate schedule, simultaneously tuning momentum and step-size, and pre-training the initialization vector. Our approach can be used to bound the sample complexity for minimizing both the validation loss as well as the number of gradient descent iterations.

Paper Structure

This paper contains 19 sections, 19 theorems, 9 equations, 1 table, 2 algorithms.

Key Result

Theorem 2.1

Suppose $\mathcal{F}$ is a class of real-valued functions with range in $[0, H]$ and finite $\mathrm{Pdim}(\mathcal{F})$. For every $\epsilon > 0$ and $\delta \in (0, 1)$, given any distribution $\mathcal{D}$ over $\mathcal{X}$, with probability $1-\delta$ over the draw of a sample $S\sim\mathcal{D}

Theorems & Definitions (34)

  • Definition 1: Shattering and Pseudo-dimension, anthony1999neural
  • Theorem 2.1: $(\epsilon,\delta)$-uniform convergence sample complexity via pseudo-dimension, anthony1999neural
  • Lemma 2.2
  • Theorem 3.1
  • proof
  • Theorem 3.2
  • proof
  • Example 1
  • Definition 2: Pfaffian Chain, khovanskiui1991fewnomials
  • Definition 3: Pfaffian functions, khovanskiui1991fewnomials
  • ...and 24 more