Cubic regularized subspace Newton for non-convex optimization

Jim Zhao; Aurelien Lucchi; Nikita Doikov

Cubic regularized subspace Newton for non-convex optimization

Jim Zhao, Aurelien Lucchi, Nikita Doikov

TL;DR

This work addresses the challenge of optimizing non-convex functions in high dimensions by introducing SSCN, a stochastic subspace cubic Newton method that applies cubic regularization to a randomly selected coordinate subset. By combining second-order information projected onto a subspace with cubic regularization and a flexible sampling strategy, SSCN achieves global convergence to stationary points and interpolates between coordinate descent and full cubic Newton as the subspace size τ grows. The authors establish convergence guarantees for arbitrary τ, derive enhanced rates with exact Hessian information, and present an adaptive sampling scheme that drives τ dynamically to attain a second-order stationary point at a rate of O(ε^{-3/2}, ε^{-3}), while demonstrating substantial empirical speed-ups over first-order methods on standard datasets. These results enable efficient, scalable second-order optimization in over-parameterized machine learning settings where full Hessian computations are prohibitive.

Abstract

This paper addresses the optimization problem of minimizing non-convex continuous functions, which is relevant in the context of high-dimensional machine learning applications characterized by over-parametrization. We analyze a randomized coordinate second-order method named SSCN which can be interpreted as applying cubic regularization in random subspaces. This approach effectively reduces the computational complexity associated with utilizing second-order information, rendering it applicable in higher-dimensional scenarios. Theoretically, we establish convergence guarantees for non-convex functions, with interpolating rates for arbitrary subspace sizes and allowing inexact curvature estimation. When increasing subspace size, our complexity matches $\mathcal{O}(ε^{-3/2})$ of the cubic regularization (CR) rate. Additionally, we propose an adaptive sampling scheme ensuring exact convergence rate of $\mathcal{O}(ε^{-3/2}, ε^{-3})$ to a second-order stationary point, even without sampling all coordinates. Experimental results demonstrate substantial speed-ups achieved by SSCN compared to conventional first-order methods.

Cubic regularized subspace Newton for non-convex optimization

TL;DR

Abstract

of the cubic regularization (CR) rate. Additionally, we propose an adaptive sampling scheme ensuring exact convergence rate of

to a second-order stationary point, even without sampling all coordinates. Experimental results demonstrate substantial speed-ups achieved by SSCN compared to conventional first-order methods.

Paper Structure (32 sections, 17 theorems, 136 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 32 sections, 17 theorems, 136 equations, 10 figures, 2 tables, 1 algorithm.

INTRODUCTION
RELATED WORK
ALGORITHM
Notation and setting
Stochastic subspace cubic newton
Complexity of solving the cubic subproblem.
CONVERGENCE ANALYSIS
General convergence rate
The power of second-order information
Adaptive sampling scheme
A practical scaling rule.
EXPERIMENTS
LIMITATIONS
Extension to arbitrary subspaces
CONCLUSION
...and 17 more sections

Key Result

Theorem 4.2

Let the sequence $\{{\bf x}_i\}$ be generated by Algorithm alg:SSCN with arbitrary $\bm{Q}_k$ satisfying SigmaDef, and any fixed $\tau \equiv \tau(S_k) \in [n]$. Let the regularization parameter at iteration $k \geq 0$ be chosen as For a given accuracy level $\varepsilon > 0$, assume that $\| \nabla f({\bf x}_i) \| \geq \varepsilon$, for all $0 \leq i \leq K$. Then, it holds

Figures (10)

Figure 1: Comparison of CD, SSCN and RS-RNM fuji2022randomized for different constant coordinate schedules, where $\tau$ denotes the dimension of the subspace in the SSCN method. Performance is measured w.r.t. iterations (first column) and time (second column) averaged over three runs for logistic regression with non-convex regularization with $\lambda = 0.1$ for the datasets gisette (first row) and madelon (second row). Experiment details are described in Section \ref{['sec:experiments']} and additional plots for the duke dataset can be found in \ref{['fig:logistic_regression_nonconv_CD_vs_SSCN_duke']} in Appendix \ref{['sec:additional_experiments']}.
Figure 2: Squared norm of the step ${\bf h}_k$ for different constant coordinate schedules for logistic regression with non-convex regularization with $\lambda = 0.1$ for two different datasets (left: gisette, right: madelon). For the same plot for the duke dataset, see \ref{['fig:logistic_regression_norm_h_k2_duke']}. Note that the y-axis is plotted in log-scale.
Figure 3: Convergence of different constant coordinate schedules measured w.r.t. iterations (first column), time (second column) and # (Coordinates$^2$ + Coordinates) evaluated (third column) averaged over three runs for logistic regression with non-convex regularization with $\lambda = 0.1$ for gisette dataset. Similar plots for the duke and madelon datasets can be found in Fig. \ref{['fig:logistic_regression_nonconv_convergence_duke_madelon']}.
Figure 4: Convergence of different constant coordinate schedules measured w.r.t. iterations (first column), time (second column) and # (Coordinates$^2$ + Coordinates) evaluated (third column) averaged over three runs for logistic regression with non-convex regularization with $\lambda = 0.1$ for two datasets. First row: duke, second row: madelon, third row: realsim.
Figure 5: Comparison of constant vs. exponential schedules $\tau(S_k) = \tau_0 + c_e \exp(dk)$ for different parameters w.r.t. iterations (first column) and time (second column) and # (Coordinates$^2$ + Coordinates) evaluated (third column) averaged over three runs for logistic regression with non-convex regularization with $\lambda = 0.1$ for the gisette, duke and madelon dataset.
...and 5 more figures

Theorems & Definitions (29)

Remark 3.2
Theorem 4.2
Lemma 4.2
Lemma 4.2
Theorem 4.3
Theorem 4.4
Proposition A.0
proof
Lemma A.0
proof
...and 19 more

Cubic regularized subspace Newton for non-convex optimization

TL;DR

Abstract

Cubic regularized subspace Newton for non-convex optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (29)