Table of Contents
Fetching ...

Cubic regularized subspace Newton for non-convex optimization

Jim Zhao, Aurelien Lucchi, Nikita Doikov

TL;DR

This work addresses the challenge of optimizing non-convex functions in high dimensions by introducing SSCN, a stochastic subspace cubic Newton method that applies cubic regularization to a randomly selected coordinate subset. By combining second-order information projected onto a subspace with cubic regularization and a flexible sampling strategy, SSCN achieves global convergence to stationary points and interpolates between coordinate descent and full cubic Newton as the subspace size τ grows. The authors establish convergence guarantees for arbitrary τ, derive enhanced rates with exact Hessian information, and present an adaptive sampling scheme that drives τ dynamically to attain a second-order stationary point at a rate of O(ε^{-3/2}, ε^{-3}), while demonstrating substantial empirical speed-ups over first-order methods on standard datasets. These results enable efficient, scalable second-order optimization in over-parameterized machine learning settings where full Hessian computations are prohibitive.

Abstract

This paper addresses the optimization problem of minimizing non-convex continuous functions, which is relevant in the context of high-dimensional machine learning applications characterized by over-parametrization. We analyze a randomized coordinate second-order method named SSCN which can be interpreted as applying cubic regularization in random subspaces. This approach effectively reduces the computational complexity associated with utilizing second-order information, rendering it applicable in higher-dimensional scenarios. Theoretically, we establish convergence guarantees for non-convex functions, with interpolating rates for arbitrary subspace sizes and allowing inexact curvature estimation. When increasing subspace size, our complexity matches $\mathcal{O}(ε^{-3/2})$ of the cubic regularization (CR) rate. Additionally, we propose an adaptive sampling scheme ensuring exact convergence rate of $\mathcal{O}(ε^{-3/2}, ε^{-3})$ to a second-order stationary point, even without sampling all coordinates. Experimental results demonstrate substantial speed-ups achieved by SSCN compared to conventional first-order methods.

Cubic regularized subspace Newton for non-convex optimization

TL;DR

This work addresses the challenge of optimizing non-convex functions in high dimensions by introducing SSCN, a stochastic subspace cubic Newton method that applies cubic regularization to a randomly selected coordinate subset. By combining second-order information projected onto a subspace with cubic regularization and a flexible sampling strategy, SSCN achieves global convergence to stationary points and interpolates between coordinate descent and full cubic Newton as the subspace size τ grows. The authors establish convergence guarantees for arbitrary τ, derive enhanced rates with exact Hessian information, and present an adaptive sampling scheme that drives τ dynamically to attain a second-order stationary point at a rate of O(ε^{-3/2}, ε^{-3}), while demonstrating substantial empirical speed-ups over first-order methods on standard datasets. These results enable efficient, scalable second-order optimization in over-parameterized machine learning settings where full Hessian computations are prohibitive.

Abstract

This paper addresses the optimization problem of minimizing non-convex continuous functions, which is relevant in the context of high-dimensional machine learning applications characterized by over-parametrization. We analyze a randomized coordinate second-order method named SSCN which can be interpreted as applying cubic regularization in random subspaces. This approach effectively reduces the computational complexity associated with utilizing second-order information, rendering it applicable in higher-dimensional scenarios. Theoretically, we establish convergence guarantees for non-convex functions, with interpolating rates for arbitrary subspace sizes and allowing inexact curvature estimation. When increasing subspace size, our complexity matches of the cubic regularization (CR) rate. Additionally, we propose an adaptive sampling scheme ensuring exact convergence rate of to a second-order stationary point, even without sampling all coordinates. Experimental results demonstrate substantial speed-ups achieved by SSCN compared to conventional first-order methods.
Paper Structure (32 sections, 17 theorems, 136 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 32 sections, 17 theorems, 136 equations, 10 figures, 2 tables, 1 algorithm.

Key Result

Theorem 4.2

Let the sequence $\{{\bf x}_i\}$ be generated by Algorithm alg:SSCN with arbitrary $\bm{Q}_k$ satisfying SigmaDef, and any fixed $\tau \equiv \tau(S_k) \in [n]$. Let the regularization parameter at iteration $k \geq 0$ be chosen as For a given accuracy level $\varepsilon > 0$, assume that $\| \nabla f({\bf x}_i) \| \geq \varepsilon$, for all $0 \leq i \leq K$. Then, it holds

Figures (10)

  • Figure 1: Comparison of CD, SSCN and RS-RNM fuji2022randomized for different constant coordinate schedules, where $\tau$ denotes the dimension of the subspace in the SSCN method. Performance is measured w.r.t. iterations (first column) and time (second column) averaged over three runs for logistic regression with non-convex regularization with $\lambda = 0.1$ for the datasets gisette (first row) and madelon (second row). Experiment details are described in Section \ref{['sec:experiments']} and additional plots for the duke dataset can be found in \ref{['fig:logistic_regression_nonconv_CD_vs_SSCN_duke']} in Appendix \ref{['sec:additional_experiments']}.
  • Figure 2: Squared norm of the step ${\bf h}_k$ for different constant coordinate schedules for logistic regression with non-convex regularization with $\lambda = 0.1$ for two different datasets (left: gisette, right: madelon). For the same plot for the duke dataset, see \ref{['fig:logistic_regression_norm_h_k2_duke']}. Note that the y-axis is plotted in log-scale.
  • Figure 3: Convergence of different constant coordinate schedules measured w.r.t. iterations (first column), time (second column) and # (Coordinates$^2$ + Coordinates) evaluated (third column) averaged over three runs for logistic regression with non-convex regularization with $\lambda = 0.1$ for gisette dataset. Similar plots for the duke and madelon datasets can be found in Fig. \ref{['fig:logistic_regression_nonconv_convergence_duke_madelon']}.
  • Figure 4: Convergence of different constant coordinate schedules measured w.r.t. iterations (first column), time (second column) and # (Coordinates$^2$ + Coordinates) evaluated (third column) averaged over three runs for logistic regression with non-convex regularization with $\lambda = 0.1$ for two datasets. First row: duke, second row: madelon, third row: realsim.
  • Figure 5: Comparison of constant vs. exponential schedules $\tau(S_k) = \tau_0 + c_e \exp(dk)$ for different parameters w.r.t. iterations (first column) and time (second column) and # (Coordinates$^2$ + Coordinates) evaluated (third column) averaged over three runs for logistic regression with non-convex regularization with $\lambda = 0.1$ for the gisette, duke and madelon dataset.
  • ...and 5 more figures

Theorems & Definitions (29)

  • Remark 3.2
  • Theorem 4.2
  • Lemma 4.2
  • Lemma 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Proposition A.0
  • proof
  • Lemma A.0
  • proof
  • ...and 19 more