Accelerated projected gradient algorithms for sparsity constrained optimization problems
Jan Harold Alcantara, Ching-pei Lee
TL;DR
This work addresses exact sparsity-constrained optimization by solving $\min_{w \in A_s} f(w)$ with $A_s = \{ w : \|w\|_0 \le s \}$ and $f$ convex with $L$-Lipschitz gradient. The authors decompose the nonconvex feasible set into a finite union of linear subspaces and develop two acceleration schemes: (i) same-space extrapolation within a fixed subspace, guided by curvature-like estimates and backtracking to ensure descent, and (ii) a subspace-identified two-stage method that first uses projected gradient to identify the active subspace and then applies a semismooth/Newton-type method in that subspace to achieve superlinear convergence. They prove global convergence of the PG method to stationary points, with local linear rates under mild contraction conditions, and show that the subspace-identified approach can achieve $Q$-quadratic to superlinear convergence, depending on the smooth subproblem's properties. Numerical experiments on large-scale ERM tasks demonstrate that the accelerated methods markedly outperform non-accelerated PG and prior nonconvex accelerations, achieving orders-of-magnitude reductions in runtime while preserving predictive performance. Overall, the paper provides practical, provably convergent acceleration techniques for exact sparsity-constrained optimization with significant implications for scalable best-subset selection and related high-dimensional problems.
Abstract
We consider the projected gradient algorithm for the nonconvex best subset selection problem that minimizes a given empirical loss function under an $\ell_0$-norm constraint. Through decomposing the feasible set of the given sparsity constraint as a finite union of linear subspaces, we present two acceleration schemes with global convergence guarantees, one by same-space extrapolation and the other by subspace identification. The former fully utilizes the problem structure to greatly accelerate the optimization speed with only negligible additional cost. The latter leads to a two-stage meta-algorithm that first uses classical projected gradient iterations to identify the correct subspace containing an optimal solution, and then switches to a highly-efficient smooth optimization method in the identified subspace to attain superlinear convergence. Experiments demonstrate that the proposed accelerated algorithms are magnitudes faster than their non-accelerated counterparts as well as the state of the art.
