Table of Contents
Fetching ...

Accelerated First-Order Optimization under Nonlinear Constraints

Michael Muehlebach, Michael I. Jordan

TL;DR

The analogies between first-order algorithms for constrained optimization and non-smooth dynamical systems are exploited to design a new class of accelerated first-order algorithms for constrained optimization that avoid optimization over the entire feasible set at each iteration.

Abstract

We exploit analogies between first-order algorithms for constrained optimization and non-smooth dynamical systems to design a new class of accelerated first-order algorithms for constrained optimization. Unlike Frank-Wolfe or projected gradients, these algorithms avoid optimization over the entire feasible set at each iteration. We prove convergence to stationary points even in a nonconvex setting and we derive accelerated rates for the convex setting both in continuous time, as well as in discrete time. An important property of these algorithms is that constraints are expressed in terms of velocities instead of positions, which naturally leads to sparse, local and convex approximations of the feasible set (even if the feasible set is nonconvex). Thus, the complexity tends to grow mildly in the number of decision variables and in the number of constraints, which makes the algorithms suitable for machine learning applications. We apply our algorithms to a compressed sensing and a sparse regression problem, showing that we can treat nonconvex $\ell^p$ constraints ($p<1$) efficiently, while recovering state-of-the-art performance for $p=1$.

Accelerated First-Order Optimization under Nonlinear Constraints

TL;DR

The analogies between first-order algorithms for constrained optimization and non-smooth dynamical systems are exploited to design a new class of accelerated first-order algorithms for constrained optimization that avoid optimization over the entire feasible set at each iteration.

Abstract

We exploit analogies between first-order algorithms for constrained optimization and non-smooth dynamical systems to design a new class of accelerated first-order algorithms for constrained optimization. Unlike Frank-Wolfe or projected gradients, these algorithms avoid optimization over the entire feasible set at each iteration. We prove convergence to stationary points even in a nonconvex setting and we derive accelerated rates for the convex setting both in continuous time, as well as in discrete time. An important property of these algorithms is that constraints are expressed in terms of velocities instead of positions, which naturally leads to sparse, local and convex approximations of the feasible set (even if the feasible set is nonconvex). Thus, the complexity tends to grow mildly in the number of decision variables and in the number of constraints, which makes the algorithms suitable for machine learning applications. We apply our algorithms to a compressed sensing and a sparse regression problem, showing that we can treat nonconvex constraints () efficiently, while recovering state-of-the-art performance for .
Paper Structure (23 sections, 12 theorems, 147 equations, 6 figures, 2 tables, 4 algorithms)

This paper contains 23 sections, 12 theorems, 147 equations, 6 figures, 2 tables, 4 algorithms.

Key Result

Theorem 1

Let $(x(t)$, $u(t))$ be a trajectory satisfying eq:mde1 with $x(0)\in C$. Let $f$ be $1$-smooth, let $g$ satisfy the Mangasarian-Fromovitz constraint qualification, and let either $f$ be convex or $2\delta - \beta > 0$. Then, $x(t)$ converges to the set of stationary points, while $u(t)$ converges t

Figures (6)

  • Figure 1: The left panel shows the normal cone inclusion $\gamma_i^+ +\epsilon \gamma_i^-\in N_{\mathbb{R}_{\leq 0}}(-\mathrm{d}\lambda_i)$, which is equivalent to the complementarity condition $\mathrm{d}\lambda_i\geq 0$, $\gamma_i^+ + \epsilon \gamma_i^- \geq 0$, $\mathrm{d}\lambda_i (\gamma_i^+ + \epsilon \gamma_i^-)=0$. The right panel shows the approximation $(x)^p_\Delta$ of $x^p$ for $\Delta=0.01$ and $p=0.6$. There is an excellent agreement between the approximation and $x^p$ even though $\Delta$ is comparably large. In the numerical experiments, see Sec. \ref{['Sec:NumEx']}, $\Delta$ is set to $10^{-6}$.
  • Figure 2: The first panel shows trajectories resulting from \ref{['eq:mde1']} (with parameters $\alpha=0.5, \delta=0.1, \beta=0, \epsilon=0$). The boundaries of $\mathcal{R}_1$ and $\mathcal{R}_2$ are highlighted in red. The second panel shows the results from the discretization \ref{['eq:dis1']} with $T_k=T=0.1$, while the third panel shows the results from the discretization \ref{['eq:disMod']} with $T_k=T=0.1$. An important difference between \ref{['eq:dis1']} and \ref{['eq:disMod']} lies in the fact that only violated constraints are considered in \ref{['eq:dis1']}, whereas \ref{['eq:disMod']} includes all constraints. This is indicated by the red lines, which denote $\mathcal{R}_1$, $\mathcal{R}_2$ in the second panel and $\gamma_1(x,u)=0$, $\gamma_2(x,u)=0$ in the third panel.
  • Figure 3: The left panel shows the solution vector of the compressed sensing problem with $\ell^1$ and $\ell^{0.8}$ regularization. The right panel shows the evolution of the objective function for the different methods. We note that Alg. \ref{['Alg:ImageDenoising']}, Alg. \ref{['Alg:ImageDenoising2']}, and accelerated projected gradients converge at a similar rate, which is much faster than gradient descent. We applied the following settings for Alg. \ref{['Alg:ImageDenoising']} and Alg. \ref{['Alg:ImageDenoising2']}: $\alpha_k=2/(k+3)$, $\delta_k=3/(2(k+3))$, $\beta_k=T(1-2\delta_k T)$ (see Tab. \ref{['Tab:params']}) with $T=1.8$ and $T=2$, respectively. Accelerated gradient descent corresponds to the algorithm from NesterovIntro. The corresponding trajectories for Alg. \ref{['Alg:ImageDenoising']} and Alg. \ref{['Alg:ImageDenoising2']} for $p<1$ are similar to $p=1$ and are shown in Fig. \ref{['Fig:SimExCS2']}.
  • Figure 4: The figure shows the trajectories of Alg. \ref{['Alg:ImageDenoising']} and Alg. \ref{['Alg:ImageDenoising2']} applied to the compressed sensing problem with $\ell^{0.8}$ regularization. The left panel shows the evolution of the objective function for the different methods, whereas the right panel shows the value of the constraint violation. We applied the following settings for Alg. \ref{['Alg:ImageDenoising']} and Alg. \ref{['Alg:ImageDenoising2']}: $\alpha_k=2/(k+3)$, $\delta_k=3/(2(k+3))$, $\beta_k=T(1-2\delta_k T)$ (see Tab. \ref{['Tab:params']}) with $T=1$ and $\Delta=1e-3$.
  • Figure 5: The figure on the left shows the decrease in the objective function as a function of the iterations for the different algorithms. We note that Alg. \ref{['Alg:ImageDenoising']}, Alg. \ref{['Alg:ImageDenoising2']}, and accelerated projected gradients converge at a similar rate, which is substantially faster than gradient descent. We applied the following settings for Alg. \ref{['Alg:ImageDenoising']} and Alg. \ref{['Alg:ImageDenoising2']}: $\alpha_k=2/(k+3)$, $\delta_k=3/(2(k+3))$, $\beta_k=T(1-2\delta_k T)$ (see Tab. \ref{['Tab:params']}) with $T=1$. Accelerated gradient descent corresponds to the algorithm by NesterovIntro. The figure on the right shows how constraint violations decrease as a function of the number of iterations. The black dashed line indicates a rate of $\mathcal{O}(1/k^2)$ as a reference. The corresponding trajectories of Alg. \ref{['Alg:ImageDenoising']} and Alg. \ref{['Alg:ImageDenoising2']} for $p<1$ are similar to $p=1$.
  • ...and 1 more figures

Theorems & Definitions (25)

  • Remark 1
  • Definition 1
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • ...and 15 more