Accelerated First-Order Optimization under Nonlinear Constraints

Michael Muehlebach; Michael I. Jordan

Accelerated First-Order Optimization under Nonlinear Constraints

Michael Muehlebach, Michael I. Jordan

TL;DR

The analogies between first-order algorithms for constrained optimization and non-smooth dynamical systems are exploited to design a new class of accelerated first-order algorithms for constrained optimization that avoid optimization over the entire feasible set at each iteration.

Abstract

We exploit analogies between first-order algorithms for constrained optimization and non-smooth dynamical systems to design a new class of accelerated first-order algorithms for constrained optimization. Unlike Frank-Wolfe or projected gradients, these algorithms avoid optimization over the entire feasible set at each iteration. We prove convergence to stationary points even in a nonconvex setting and we derive accelerated rates for the convex setting both in continuous time, as well as in discrete time. An important property of these algorithms is that constraints are expressed in terms of velocities instead of positions, which naturally leads to sparse, local and convex approximations of the feasible set (even if the feasible set is nonconvex). Thus, the complexity tends to grow mildly in the number of decision variables and in the number of constraints, which makes the algorithms suitable for machine learning applications. We apply our algorithms to a compressed sensing and a sparse regression problem, showing that we can treat nonconvex $\ell^p$ constraints ($p<1$) efficiently, while recovering state-of-the-art performance for $p=1$.

Accelerated First-Order Optimization under Nonlinear Constraints

TL;DR

Abstract

constraints (

) efficiently, while recovering state-of-the-art performance for

Paper Structure (23 sections, 12 theorems, 147 equations, 6 figures, 2 tables, 4 algorithms)

This paper contains 23 sections, 12 theorems, 147 equations, 6 figures, 2 tables, 4 algorithms.

Introduction
Velocity Constraints
Accelerated Gradient Flow
Discretization of \ref{['eq:mde1']}:
Smooth motion:
Non-smooth motion:
Equilibria of \ref{['eq:mde1']}:
Convergence Analysis
i) Heavy ball ($\alpha=\sqrt{\mu}$):
ii) Nesterov - constant parameters ($\alpha=\sqrt{\mu}-\mu/2$):
iii) Nesterov - varying parameters ($\alpha(t)=2/(t+3)$):
Numerical Examples
Illustrative example
Nonconvex compressed sensing and image reconstruction
Nonconvex compressed sensing example:
...and 8 more sections

Key Result

Theorem 1

Let $(x(t)$, $u(t))$ be a trajectory satisfying eq:mde1 with $x(0)\in C$. Let $f$ be $1$-smooth, let $g$ satisfy the Mangasarian-Fromovitz constraint qualification, and let either $f$ be convex or $2\delta - \beta > 0$. Then, $x(t)$ converges to the set of stationary points, while $u(t)$ converges t

Figures (6)

Figure 1: The left panel shows the normal cone inclusion $\gamma_i^+ +\epsilon \gamma_i^-\in N_{\mathbb{R}_{\leq 0}}(-\mathrm{d}\lambda_i)$, which is equivalent to the complementarity condition $\mathrm{d}\lambda_i\geq 0$, $\gamma_i^+ + \epsilon \gamma_i^- \geq 0$, $\mathrm{d}\lambda_i (\gamma_i^+ + \epsilon \gamma_i^-)=0$. The right panel shows the approximation $(x)^p_\Delta$ of $x^p$ for $\Delta=0.01$ and $p=0.6$. There is an excellent agreement between the approximation and $x^p$ even though $\Delta$ is comparably large. In the numerical experiments, see Sec. \ref{['Sec:NumEx']}, $\Delta$ is set to $10^{-6}$.
Figure 2: The first panel shows trajectories resulting from \ref{['eq:mde1']} (with parameters $\alpha=0.5, \delta=0.1, \beta=0, \epsilon=0$). The boundaries of $\mathcal{R}_1$ and $\mathcal{R}_2$ are highlighted in red. The second panel shows the results from the discretization \ref{['eq:dis1']} with $T_k=T=0.1$, while the third panel shows the results from the discretization \ref{['eq:disMod']} with $T_k=T=0.1$. An important difference between \ref{['eq:dis1']} and \ref{['eq:disMod']} lies in the fact that only violated constraints are considered in \ref{['eq:dis1']}, whereas \ref{['eq:disMod']} includes all constraints. This is indicated by the red lines, which denote $\mathcal{R}_1$, $\mathcal{R}_2$ in the second panel and $\gamma_1(x,u)=0$, $\gamma_2(x,u)=0$ in the third panel.
Figure 3: The left panel shows the solution vector of the compressed sensing problem with $\ell^1$ and $\ell^{0.8}$ regularization. The right panel shows the evolution of the objective function for the different methods. We note that Alg. \ref{['Alg:ImageDenoising']}, Alg. \ref{['Alg:ImageDenoising2']}, and accelerated projected gradients converge at a similar rate, which is much faster than gradient descent. We applied the following settings for Alg. \ref{['Alg:ImageDenoising']} and Alg. \ref{['Alg:ImageDenoising2']}: $\alpha_k=2/(k+3)$, $\delta_k=3/(2(k+3))$, $\beta_k=T(1-2\delta_k T)$ (see Tab. \ref{['Tab:params']}) with $T=1.8$ and $T=2$, respectively. Accelerated gradient descent corresponds to the algorithm from NesterovIntro. The corresponding trajectories for Alg. \ref{['Alg:ImageDenoising']} and Alg. \ref{['Alg:ImageDenoising2']} for $p<1$ are similar to $p=1$ and are shown in Fig. \ref{['Fig:SimExCS2']}.
Figure 4: The figure shows the trajectories of Alg. \ref{['Alg:ImageDenoising']} and Alg. \ref{['Alg:ImageDenoising2']} applied to the compressed sensing problem with $\ell^{0.8}$ regularization. The left panel shows the evolution of the objective function for the different methods, whereas the right panel shows the value of the constraint violation. We applied the following settings for Alg. \ref{['Alg:ImageDenoising']} and Alg. \ref{['Alg:ImageDenoising2']}: $\alpha_k=2/(k+3)$, $\delta_k=3/(2(k+3))$, $\beta_k=T(1-2\delta_k T)$ (see Tab. \ref{['Tab:params']}) with $T=1$ and $\Delta=1e-3$.
Figure 5: The figure on the left shows the decrease in the objective function as a function of the iterations for the different algorithms. We note that Alg. \ref{['Alg:ImageDenoising']}, Alg. \ref{['Alg:ImageDenoising2']}, and accelerated projected gradients converge at a similar rate, which is substantially faster than gradient descent. We applied the following settings for Alg. \ref{['Alg:ImageDenoising']} and Alg. \ref{['Alg:ImageDenoising2']}: $\alpha_k=2/(k+3)$, $\delta_k=3/(2(k+3))$, $\beta_k=T(1-2\delta_k T)$ (see Tab. \ref{['Tab:params']}) with $T=1$. Accelerated gradient descent corresponds to the algorithm by NesterovIntro. The figure on the right shows how constraint violations decrease as a function of the number of iterations. The black dashed line indicates a rate of $\mathcal{O}(1/k^2)$ as a reference. The corresponding trajectories of Alg. \ref{['Alg:ImageDenoising']} and Alg. \ref{['Alg:ImageDenoising2']} for $p<1$ are similar to $p=1$.
...and 1 more figures

Theorems & Definitions (25)

Remark 1
Definition 1
Theorem 1
proof
Theorem 2
proof
Theorem 3
proof
Theorem 4
proof
...and 15 more

Accelerated First-Order Optimization under Nonlinear Constraints

TL;DR

Abstract

Accelerated First-Order Optimization under Nonlinear Constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (25)