Table of Contents
Fetching ...

Learning Algorithm Hyperparameters for Fast Parametric Convex Optimization

Rajiv Sambharya, Bartolomeo Stellato

TL;DR

This work targets accelerating parametric convex optimization by learning a fixed, shared sequence of hyperparameters for first-order methods through a two-phase LAH framework (step-varying then steady-state). The authors derive closed-form lookahead results for gradient descent and quadratic problems, develop progressive training, and provide generalization guarantees via validation-based risk bounds, all while maintaining convergence guarantees. LAH is demonstrated across gradient descent, proximal gradient descent, OSQP, and SCS on diverse tasks in control, signal processing, and machine learning, with remarkable data efficiency requiring only 10 training instances. The approach yields substantial speedups over baselines, preserves convergence, and offers quantifiable probabilistic guarantees on unseen data, suggesting practical impact for fast, reliable parametric optimization in time-constrained systems.

Abstract

We introduce a machine-learning framework to learn the hyperparameter sequence of first-order methods (e.g., the step sizes in gradient descent) to quickly solve parametric convex optimization problems. Our computational architecture amounts to running fixed-point iterations where the hyperparameters are the same across all parametric instances and consists of two phases. In the first step-varying phase the hyperparameters vary across iterations, while in the second steady-state phase the hyperparameters are constant across iterations. Our learned optimizer is flexible in that it can be evaluated on any number of iterations and is guaranteed to converge to an optimal solution. To train, we minimize the mean square error to a ground truth solution. In the case of gradient descent, the one-step optimal step size is the solution to a least squares problem, and in the case of unconstrained quadratic minimization, we can compute the two and three-step optimal solutions in closed-form. In other cases, we backpropagate through the algorithm steps to minimize the training objective after a given number of steps. We show how to learn hyperparameters for several popular algorithms: gradient descent, proximal gradient descent, and two ADMM-based solvers: OSQP and SCS. We use a sample convergence bound to obtain generalization guarantees for the performance of our learned algorithm for unseen data, providing both lower and upper bounds. We showcase the effectiveness of our method with many examples, including ones from control, signal processing, and machine learning. Remarkably, our approach is highly data-efficient in that we only use $10$ problem instances to train the hyperparameters in all of our examples.

Learning Algorithm Hyperparameters for Fast Parametric Convex Optimization

TL;DR

This work targets accelerating parametric convex optimization by learning a fixed, shared sequence of hyperparameters for first-order methods through a two-phase LAH framework (step-varying then steady-state). The authors derive closed-form lookahead results for gradient descent and quadratic problems, develop progressive training, and provide generalization guarantees via validation-based risk bounds, all while maintaining convergence guarantees. LAH is demonstrated across gradient descent, proximal gradient descent, OSQP, and SCS on diverse tasks in control, signal processing, and machine learning, with remarkable data efficiency requiring only 10 training instances. The approach yields substantial speedups over baselines, preserves convergence, and offers quantifiable probabilistic guarantees on unseen data, suggesting practical impact for fast, reliable parametric optimization in time-constrained systems.

Abstract

We introduce a machine-learning framework to learn the hyperparameter sequence of first-order methods (e.g., the step sizes in gradient descent) to quickly solve parametric convex optimization problems. Our computational architecture amounts to running fixed-point iterations where the hyperparameters are the same across all parametric instances and consists of two phases. In the first step-varying phase the hyperparameters vary across iterations, while in the second steady-state phase the hyperparameters are constant across iterations. Our learned optimizer is flexible in that it can be evaluated on any number of iterations and is guaranteed to converge to an optimal solution. To train, we minimize the mean square error to a ground truth solution. In the case of gradient descent, the one-step optimal step size is the solution to a least squares problem, and in the case of unconstrained quadratic minimization, we can compute the two and three-step optimal solutions in closed-form. In other cases, we backpropagate through the algorithm steps to minimize the training objective after a given number of steps. We show how to learn hyperparameters for several popular algorithms: gradient descent, proximal gradient descent, and two ADMM-based solvers: OSQP and SCS. We use a sample convergence bound to obtain generalization guarantees for the performance of our learned algorithm for unseen data, providing both lower and upper bounds. We showcase the effectiveness of our method with many examples, including ones from control, signal processing, and machine learning. Remarkably, our approach is highly data-efficient in that we only use problem instances to train the hyperparameters in all of our examples.

Paper Structure

This paper contains 77 sections, 8 theorems, 63 equations, 11 figures, 8 tables.

Key Result

Theorem 1

The one-step optimal step size $\theta^k$ for problem prob:gd_train_single is non-negative for any possible values of $\{z^k_\theta(x_i)\}_{i=1}^N$ as long as there exists some $z^k_\theta(x_i)$ such that $\nabla f(z^k_\theta(x_i),x_i) \neq 0$.

Figures (11)

  • Figure 1: LAH diagram. Running LAH amounts to running fixed-point iterations with the learned hyperparameters $\theta$ across the iterations and consists of two phases. First, in the time-varying phase, we run $H$ fixed-point iterations each with a varying hyperparameter set $\theta^k$ in the $k$-th iteration and initialized with $z^0(x) = 0$. Second, after the initial $K$ steps, in the steady-state phase, we run $K - H$ fixed-point iterations each using the same hyperparameter set $\theta^H$. At evaluation time, we are free to run any number of iterations: not just the number of steps trained on.
  • Figure 2: Ridge regression results. The conjugate gradient method performs the best out of all of the methods. Our learned step sizes significantly outperform both Nesterov's method and the silver step size rule. Only $10$ training instances are used for each data-driven method. Because of this, the data-driven initialization methods, the nearest neighbor and L2WS, do not perform well. Considering more steps at a time in the progressive training improves the performance for LAH, but recall that the $1$, $2$, or $3$-step lookahead problems can be solved optimally, but we use gradient-based methods for the $10$-step lookahead problems.
  • Figure 3: Step sizes in gradient descent to solve the ridge regression problem. First on the left: silver step size schedule. Four on the right: our learned step sizes. For the first $50$ steps we learn varying step sizes in black, and for the rest, we learned a constant step size in gray. In pink, we show $2 / L$, the maximum constant step size that guarantees convergence. We observe that our learned step sizes have many short steps and several long ones -- similar to the silver step size schedule.
  • Figure 4: Logistic regression results. Progressive training $10$ steps at a time reaches a geometric auboptimality average of below $10^{-5}$ within the step-varying phase (the first $100$ steps). In this case, we only provide the quantile bounds for progressive training $10$ steps at a time, and the bounds are wide.
  • Figure 5: Logistic regression step sizes. Many of the step sizes learned with LAH in the step-varying phase are orders of magnitude larger than the maximum constant step size that guarantees convergence ($2 / L$).
  • ...and 6 more figures

Theorems & Definitions (8)

  • Theorem 1
  • Theorem 2
  • Theorem 3: Convergence rate for quadratic minimization Young1953
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Theorem 8: Sample convergence bound langford_union_prior