ODE-based Learning to Optimize

Zhonglin Xie; Wotao Yin; Zaiwen Wen

ODE-based Learning to Optimize

Zhonglin Xie, Wotao Yin, Zaiwen Wen

TL;DR

The paper addresses translating continuous-time acceleration dynamics into robust discrete-time optimization by introducing the ISHD ODE and analyzing its explicit Euler discretization under convergence and stability conditions. It then pairs this with a learning-to-optimize (L2O) framework (StoPM) to learn ISHD coefficients by minimizing the stopping time, all while ensuring convergence via conservative gradients. Theoretical results establish continuous-time convergence and discrete-time stability, and the approach is validated through extensive experiments on logistic regression and $\ell_p^p$ minimization, showing that learned coefficients can outperform classical baselines like NAG and IGAHD. Overall, the work provides a principled bridge between ODE-based acceleration and learned optimizers, with practical algorithms and strong theoretical guarantees for convergence and stability.

Abstract

Recent years have seen a growing interest in understanding acceleration methods through the lens of ordinary differential equations (ODEs). Despite the theoretical advancements, translating the rapid convergence observed in continuous-time models to discrete-time iterative methods poses significant challenges. In this paper, we present a comprehensive framework integrating the inertial systems with Hessian-driven damping equation (ISHD) and learning-based approaches for developing optimization methods through a deep synergy of theoretical insights. We first establish the convergence condition for ensuring the convergence of the solution trajectory of ISHD. Then, we show that provided the stability condition, another relaxed requirement on the coefficients of ISHD, the sequence generated through the explicit Euler discretization of ISHD converges, which gives a large family of practical optimization methods. In order to select the best optimization method in this family for certain problems, we introduce the stopping time, the time required for an optimization method derived from ISHD to achieve a predefined level of suboptimality. Then, we formulate a novel learning to optimize (L2O) problem aimed at minimizing the stopping time subject to the convergence and stability condition. To navigate this learning problem, we present an algorithm combining stochastic optimization and the penalty method (StoPM). The convergence of StoPM using the conservative gradient is proved. Empirical validation of our framework is conducted through extensive numerical experiments across a diverse set of optimization problems. These experiments showcase the superior performance of the learned optimization methods.

ODE-based Learning to Optimize

TL;DR

minimization, showing that learned coefficients can outperform classical baselines like NAG and IGAHD. Overall, the work provides a principled bridge between ODE-based acceleration and learned optimizers, with practical algorithms and strong theoretical guarantees for convergence and stability.

Abstract

Paper Structure (34 sections, 26 theorems, 156 equations, 7 figures, 5 tables, 2 algorithms)

This paper contains 34 sections, 26 theorems, 156 equations, 7 figures, 5 tables, 2 algorithms.

Introduction
Related works
ODE viewpoint of optimizetion methods
Learning to optimize
Our contributions
Organization
Conditions that ensure a stable discretization of ISHD
Preliminaries
A condition that ensures the trajectory of ISHD converges
A condition that ensures the stability of the explicit Euler discretization
Selecting the best coefficients of ISHD using L2O
The problem formulation of L2O
Solving the L2O problem using penalty method and stochastic optimization
Deriving the conservative gradient of the penalty function
The conservative Jacobian of the flow $X(t,\theta,f)$
...and 19 more sections

Key Result

theorem 1

Suppose that Assumption assump:differentiable and the following conditions hold true: Then, the solution trajectory of eq:ISHD, $x(t)$, is bounded and the following inequalities can be derived:

Figures (7)

Figure 1: Our learning and testing framework.
Figure 2: Numerical verification of the $(L_0,L_1)$-smoothness.
Figure 3: The training process in different tasks.
Figure 4: Different indicators of $\ell_{p}^{p}$ minimization problem on a5a dataset.
Figure 5: Comparison on logistic regression.
...and 2 more figures

Theorems & Definitions (63)

theorem 1
theorem 2: Convergence rate
remark thmcounterremark
remark thmcounterremark
remark thmcounterremark
definition thmcounterdefinition: Stopping Time
definition thmcounterdefinition: Induced Probability Space
definition thmcounterdefinition: Conservative Jacobian
definition thmcounterdefinition: Path differentiability
theorem 3: Path differentiability of ODE flows
...and 53 more

ODE-based Learning to Optimize

TL;DR

Abstract

ODE-based Learning to Optimize

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (63)