Accelerated projected gradient algorithms for sparsity constrained optimization problems

Jan Harold Alcantara; Ching-pei Lee

Accelerated projected gradient algorithms for sparsity constrained optimization problems

Jan Harold Alcantara, Ching-pei Lee

TL;DR

This work addresses exact sparsity-constrained optimization by solving $\min_{w \in A_s} f(w)$ with $A_s = \{ w : \|w\|_0 \le s \}$ and $f$ convex with $L$-Lipschitz gradient. The authors decompose the nonconvex feasible set into a finite union of linear subspaces and develop two acceleration schemes: (i) same-space extrapolation within a fixed subspace, guided by curvature-like estimates and backtracking to ensure descent, and (ii) a subspace-identified two-stage method that first uses projected gradient to identify the active subspace and then applies a semismooth/Newton-type method in that subspace to achieve superlinear convergence. They prove global convergence of the PG method to stationary points, with local linear rates under mild contraction conditions, and show that the subspace-identified approach can achieve $Q$-quadratic to superlinear convergence, depending on the smooth subproblem's properties. Numerical experiments on large-scale ERM tasks demonstrate that the accelerated methods markedly outperform non-accelerated PG and prior nonconvex accelerations, achieving orders-of-magnitude reductions in runtime while preserving predictive performance. Overall, the paper provides practical, provably convergent acceleration techniques for exact sparsity-constrained optimization with significant implications for scalable best-subset selection and related high-dimensional problems.

Abstract

We consider the projected gradient algorithm for the nonconvex best subset selection problem that minimizes a given empirical loss function under an $\ell_0$-norm constraint. Through decomposing the feasible set of the given sparsity constraint as a finite union of linear subspaces, we present two acceleration schemes with global convergence guarantees, one by same-space extrapolation and the other by subspace identification. The former fully utilizes the problem structure to greatly accelerate the optimization speed with only negligible additional cost. The latter leads to a two-stage meta-algorithm that first uses classical projected gradient iterations to identify the correct subspace containing an optimal solution, and then switches to a highly-efficient smooth optimization method in the identified subspace to attain superlinear convergence. Experiments demonstrate that the proposed accelerated algorithms are magnitudes faster than their non-accelerated counterparts as well as the state of the art.

Accelerated projected gradient algorithms for sparsity constrained optimization problems

TL;DR

This work addresses exact sparsity-constrained optimization by solving

with

and

convex with

-Lipschitz gradient. The authors decompose the nonconvex feasible set into a finite union of linear subspaces and develop two acceleration schemes: (i) same-space extrapolation within a fixed subspace, guided by curvature-like estimates and backtracking to ensure descent, and (ii) a subspace-identified two-stage method that first uses projected gradient to identify the active subspace and then applies a semismooth/Newton-type method in that subspace to achieve superlinear convergence. They prove global convergence of the PG method to stationary points, with local linear rates under mild contraction conditions, and show that the subspace-identified approach can achieve

-quadratic to superlinear convergence, depending on the smooth subproblem's properties. Numerical experiments on large-scale ERM tasks demonstrate that the accelerated methods markedly outperform non-accelerated PG and prior nonconvex accelerations, achieving orders-of-magnitude reductions in runtime while preserving predictive performance. Overall, the paper provides practical, provably convergent acceleration techniques for exact sparsity-constrained optimization with significant implications for scalable best-subset selection and related high-dimensional problems.

Abstract

We consider the projected gradient algorithm for the nonconvex best subset selection problem that minimizes a given empirical loss function under an

-norm constraint. Through decomposing the feasible set of the given sparsity constraint as a finite union of linear subspaces, we present two acceleration schemes with global convergence guarantees, one by same-space extrapolation and the other by subspace identification. The former fully utilizes the problem structure to greatly accelerate the optimization speed with only negligible additional cost. The latter leads to a two-stage meta-algorithm that first uses classical projected gradient iterations to identify the correct subspace containing an optimal solution, and then switches to a highly-efficient smooth optimization method in the identified subspace to attain superlinear convergence. Experiments demonstrate that the proposed accelerated algorithms are magnitudes faster than their non-accelerated counterparts as well as the state of the art.

Paper Structure (26 sections, 6 theorems, 81 equations, 6 figures, 10 tables)

This paper contains 26 sections, 6 theorems, 81 equations, 6 figures, 10 tables.

Introduction
Related Works.
Contributions.
Projected Gradient Algorithm
Accelerated methods
Acceleration by extrapolation
Subspace Identification
Experiments
Comparisons of algorithms for large datasets.
Transition Plots.
Conclusions
Implementation Details for \ref{['sec:identify']}
Experimental settings
Additional Experiments
Other settings of $s$
...and 11 more sections

Key Result

Theorem 2.1

Let $\{w ^k\}$ be a sequence generated by eq:pgm. Then:

Figures (6)

Figure 1: Experiment on sparse regularized LR and LS. We present time v.s. residual in \ref{['eq:opt']}.
Figure 2: Transition plots. We present sparsity levels versus running time (in log scale). Top row: logistic loss. Bottom row: least square loss.
Figure 3: Sparse regularized logistic loss regression.
Figure 4: Sparse least squares regression.
Figure 5: Prediction performance of different methods for sparse logistic regression ( news20, rcv1.binary, webspam) and least squares regression ( E2006-log1p) across varying levels of residuals $\epsilon = 10^{-k}$, with $k=1,2,\dots, 6$. Generated plots correspond to sparsity level of $s=\lceil0.01m\rceil$.
...and 1 more figures

Theorems & Definitions (14)

Theorem 2.1
Theorem 2.2
Theorem 3.1
Theorem 3.2
Theorem 3.3
proof : Proof of part (a)
proof : Proof of part (b)
proof : Proof of part (c)
proof
proof
...and 4 more

Accelerated projected gradient algorithms for sparsity constrained optimization problems

TL;DR

Abstract

Accelerated projected gradient algorithms for sparsity constrained optimization problems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (14)