Table of Contents
Fetching ...

Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation

Tianqi Qiao, Marie Maros

TL;DR

Sparse Polyak addresses high-dimensional M-estimation under $d/n\to\infty$ by re-engineering Polyak's adaptive step size to rely on the restricted gradient magnitude $\|\mathrm{HT}_s(\nabla f(\theta_t))\|^2$, effectively estimating a restricted Lipschitz smoothness constant. The main theoretical result proves linear convergence to near-optimal statistical precision with a rate $1-1/(80\bar{\kappa})$ under a sparsity condition $s\ge(240\bar{\kappa})^2 s^*$, and shows that the rate and final precision remain invariant as problem size grows when the restricted quantities stay fixed. The paper provides probabilistic guarantees for sparse logistic regression and low-rank matrix regression, including conditions for support or rank recovery via SNR assumptions, and introduces an adaptive double-loop variant to handle unknown targets $f^*$. Empirical results on synthetic and real data demonstrate robust performance and rate-invariance, with Sparse Polyak outperforming classical Polyak and standard IHT in high-dimensional settings.

Abstract

We propose and study Sparse Polyak, a variant of Polyak's adaptive step size, designed to solve high-dimensional statistical estimation problems where the problem dimension is allowed to grow much faster than the sample size. In such settings, the standard Polyak step size performs poorly, requiring an increasing number of iterations to achieve optimal statistical precision-even when, the problem remains well conditioned and/or the achievable precision itself does not degrade with problem size. We trace this limitation to a mismatch in how smoothness is measured: in high dimensions, it is no longer effective to estimate the Lipschitz smoothness constant. Instead, it is more appropriate to estimate the smoothness restricted to specific directions relevant to the problem (restricted Lipschitz smoothness constant). Sparse Polyak overcomes this issue by modifying the step size to estimate the restricted Lipschitz smoothness constant. We support our approach with both theoretical analysis and numerical experiments, demonstrating its improved performance.

Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation

TL;DR

Sparse Polyak addresses high-dimensional M-estimation under by re-engineering Polyak's adaptive step size to rely on the restricted gradient magnitude , effectively estimating a restricted Lipschitz smoothness constant. The main theoretical result proves linear convergence to near-optimal statistical precision with a rate under a sparsity condition , and shows that the rate and final precision remain invariant as problem size grows when the restricted quantities stay fixed. The paper provides probabilistic guarantees for sparse logistic regression and low-rank matrix regression, including conditions for support or rank recovery via SNR assumptions, and introduces an adaptive double-loop variant to handle unknown targets . Empirical results on synthetic and real data demonstrate robust performance and rate-invariance, with Sparse Polyak outperforming classical Polyak and standard IHT in high-dimensional settings.

Abstract

We propose and study Sparse Polyak, a variant of Polyak's adaptive step size, designed to solve high-dimensional statistical estimation problems where the problem dimension is allowed to grow much faster than the sample size. In such settings, the standard Polyak step size performs poorly, requiring an increasing number of iterations to achieve optimal statistical precision-even when, the problem remains well conditioned and/or the achievable precision itself does not degrade with problem size. We trace this limitation to a mismatch in how smoothness is measured: in high dimensions, it is no longer effective to estimate the Lipschitz smoothness constant. Instead, it is more appropriate to estimate the smoothness restricted to specific directions relevant to the problem (restricted Lipschitz smoothness constant). Sparse Polyak overcomes this issue by modifying the step size to estimate the restricted Lipschitz smoothness constant. We support our approach with both theoretical analysis and numerical experiments, demonstrating its improved performance.

Paper Structure

This paper contains 24 sections, 14 theorems, 97 equations, 3 figures, 2 algorithms.

Key Result

Theorem 1

Let $\{\theta_t\}_{t \geq 1}$ denote the sequence of iterates generated by Algorithm algo:iht. Suppose the objective function $f$ satisfies the RSC and RSS in Assumptions asp:rscvx and asp:rsmooth, respectively. Let $\widehat{\theta}$ be any $s^{*}$-sparse vector such that $f(\widehat{\theta}) = \wi Moreover, let $t_0 \geq 0$ be the first iteration for which $\|\theta_{t_0} - \widehat{\theta}\|^2

Figures (3)

  • Figure 1: Performance of Polyak's step size (dashed) and Sparse Polyak (solid) on logistic regression problems with increasing $d$ and $n.$ The quantities $s,$$s^{*},$$\bar{\kappa}$ and $\frac{\log (d)}{n}$ remain constant. With Polyak's step-size the performance degrades as $d$ increases whereas Sparse Polyak exhibits rate invariance, i.e. the number of iterations to achieve (near) optimal statistical precision does not change.
  • Figure 2: Left and center: IHT with $\frac{2}{3 \bar{L}}$ (blue) vs. Algorithm \ref{['algo:iht']} (red) on linear and logistic regression respectively. Right: Choice of $\widehat{f}$ on Algorithm \ref{['algo:iht']}. In all scenarios $\alpha= 5,\, d = 5000$ and $s = 700.$
  • Figure 3: Performance comparison of IHT with optimal constant step size, Sparse Polyak and classical Polyak when performing: (left) linear regression on the Wave Energy Farm data set, and (right) logistic regression on the Molecule Musk data set.

Theorems & Definitions (22)

  • Definition 1: Hard Thresholding Operator
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Lemma 1
  • Corollary 2
  • Remark 1
  • Lemma 2
  • Corollary 3
  • Remark 2
  • ...and 12 more