A Normal Map-Based Proximal Stochastic Gradient Method: Convergence and Identification Properties

Junwen Qiu; Li Jiang; Andre Milzarek

A Normal Map-Based Proximal Stochastic Gradient Method: Convergence and Identification Properties

Junwen Qiu, Li Jiang, Andre Milzarek

TL;DR

This work introduces NSGD, a normal-map-based proximal stochastic gradient method, to solve nonconvex composite problems without variance reduction. By leveraging Robinson's normal map and a carefully designed merit function, NSGD achieves global convergence to stationary points a.s., provides nonasymptotic complexity bounds comparable to PSGD, and attains finite-time manifold identification under mild definability assumptions via a Kurdyka-Łojasiewicz framework. The approach yields stronger identification properties than standard Prox-SGD, while maintaining similar computational costs per iteration. Theoretical results are complemented by numerical experiments in nonconvex classification and sparse+low-rank decomposition, where NSGD exhibits improved sparsity, lower rank, and faster convergence. Overall, the normal-map perspective offers a robust, variance-reduction-free pathway to convergence and identification in stochastic nonconvex optimization.

Abstract

The proximal stochastic gradient method (PSGD) is one of the state-of-the-art approaches for stochastic composite-type problems. In contrast to its deterministic counterpart, PSGD has been found to have difficulties with the correct identification of underlying substructures (such as supports, low rank patterns, or active constraints) and it does not possess a finite-time manifold identification property. Existing solutions rely on convexity assumptions or on the additional usage of variance reduction techniques. In this paper, we address these limitations and present a simple variant of PSGD based on Robinson's normal map. The proposed normal map-based proximal stochastic gradient method (NSGD) is shown to converge globally, i.e., accumulation points of the generated iterates correspond to stationary points almost surely. In addition, we establish complexity bounds for NSGD that match the known results for PSGD and we prove that NSGD can almost surely identify active manifolds in finite-time in a general nonconvex setting. Our derivations are built on almost sure iterate convergence guarantees and utilize analysis techniques based on the Kurdyka-Lojasiewicz inequality.

A Normal Map-Based Proximal Stochastic Gradient Method: Convergence and Identification Properties

TL;DR

Abstract

Paper Structure (28 sections, 16 theorems, 80 equations, 4 figures, 1 table)

This paper contains 28 sections, 16 theorems, 80 equations, 4 figures, 1 table.

Introduction
Designing Norm-SGD
Stationarity measures
Additional literature on Prox-SGD
Preliminaries and basic properties
Preparatory lemmas
Bounding the normal map and function values
Merit function and approximate descent
Complexity bound and global convergence
Complexity bound
Global convergence
Iterate convergence and manifold identification
A Kurdyka-Lojasiewicz-type property for H
Iterate convergence
Proof of Theorem 4.2
...and 13 more sections

Key Result

Lemma 2.1

\newlabellem:est-err0 Let A1--A2 and B1--B2 hold and let $\{\boldsymbol{x}^k\}_k$ and $\{\boldsymbol{z}^k\}_k$ be generated by $\mathsf{Norm}\text{-}\mathsf{SGD}$. Let us set ${\sf L}_\lambda := {\sf L}+2\lambda^{-1}$. For all $0\leq m<n$ with $\tau_{m,n} \leq 1/(2{\sf L}_\lambda)$, we have

Figures (4)

Figure 1: Behavior of $\mathsf{Prox}\text{-}\mathsf{SGD}$ and $\mathsf{Norm}\text{-}\mathsf{SGD}$ on ${\min}_{x \in \mathbb{R}}~f(x)+ \varphi(x)$ where $f(x) := x$ and $\varphi(x) := \mathds{1}_{[-1,1]}(x)$. This example originates from duchi2021asymptotic. $\mathsf{Prox}\text{-}\mathsf{SGD}$ and $\mathsf{Norm}\text{-}\mathsf{SGD}$ use the stochastic gradients $g^k = f^\prime(x^k) + e^k = 1+e^k$ with iid Gaussian noise $e^k\sim \mathcal{N}(0,1)$ and $\alpha_k := \frac{1}{k}$, $\lambda = 1$. In this scenario, both $\mathsf{Prox}\text{-}\mathsf{SGD}$ and $\mathsf{Norm}\text{-}\mathsf{SGD}$ converge to the global solution $x^* = -1$ a$.$s$.$. The plots depict the distance $|x^k-x^*|$ (in logarithmic scale) for 10 independent runs with $x^0 = z^0 = 100$. The dash-dotted line, $k\mapsto\frac{3}{k}$, approximates the convergence trend of $\mathsf{Prox}\text{-}\mathsf{SGD}$. Notably, $\mathsf{Prox}\text{-}\mathsf{SGD}$ repeatedly escapes from $x^*$ (i.e., from the active manifold $\mathcal{M}_{x^*} := \{x^*\}$), whereas $\mathsf{Norm}\text{-}\mathsf{SGD}$ remains at $x^*$ after identifying the active constraint. Indeed, in duchi2021asymptotic, it is shown that the stochastic process $\{\boldsymbol{x}^k\}_k$ generated by $\mathsf{Prox}\text{-}\mathsf{SGD}$ satisfies $\mathbb{P}(\boldsymbol{x}^{k} \geq -1 + \alpha_k) \geq \varepsilon > 0$. Thus, $\{\boldsymbol{x}^k\}_k$ will not stay on $\mathcal{M}_{x^*}$ a$.$s$.$ even if the optimal solution is correctly identified.
Figure 1: Performance and sparsity information of $\mathsf{Norm}\text{-}\mathsf{SGD}$ and $\mathsf{Prox}\text{-}\mathsf{SGD}$ for the classification problem \ref{['eq:binary-clas']}. The sparsity is measured via $100\% \cdot |\{i: |x_i^k| \leq 10^{-8}\}|/d$. (Averaged over $10$ runs).
Figure 2: Relation between different stationary measures. We consider ${\min}_{x \in \mathbb{R}^2}~f(x)+ \varphi(x)$, where $f(x) := \frac{1}{2}\|x+[2,1]^\top\|^2$ and $\varphi(x) := \nu \|x\|^2 + \|x\|_1$, $\nu>0$. We further set $\lambda = \frac{1}{2}$, $z=0$, and $x=\mathrm{prox}_{\lambda\varphi}(z) = 0$. Left: Simple calculations yield $F^{\lambda}_{\mathrm{nat}}(x) = [\frac{1}{1+\nu},0]^\top$, $\nu = 1$, $F^{\lambda}_{\mathrm{nor}}(z) = [2,1]^\top$, and $\partial \psi(x)= [1,3]\times[0,2]$. Clearly, it holds that $F^{\lambda}_{\mathrm{nor}}(z) \in \partial \psi(x)$ but $F^{\lambda}_{\mathrm{nat}}(x) \notin \partial \psi(x)$. Right: Comparison of $\|F^{\lambda}_{\mathrm{nat}}(x)\|$, $\|F^{\lambda}_{\mathrm{nor}}(z)\|$, and $\|\partial\psi(x)\|_{-}$ for different regularization parameters $\nu$.
Figure 2: Performance of $\mathsf{Norm}\text{-}\mathsf{SGD}$ and $\mathsf{Prox}\text{-}\mathsf{SGD}$ on the sparse $+$ low-rank task \ref{['eq:prob-pcp']}. Each row depicts the results for different step sizes $\alpha_k = \frac{1}{2}(k+1)^{-\gamma}$, $\gamma \in \{\frac{2}{3},\frac{3}{4},1\}$. In the first column, we plot the rank of $\{X^k\}$ (the number of singular values larger than $10^{-6}$; using solid lines) and the sparsity level $100\%\cdot|\{i,j:|Y_{ij}^k| \leq 10^{-6}\}|/(mn)$ of $\{Y^k\}_k$ (using dashed lines). The cpu-time per iteration is shown in the middle column; the legend depicts the total running time. The right column illustrates the change in the objective function values. (Averaged over $5$ runs).

Theorems & Definitions (33)

Lemma 2.1: Iterate bounds
Lemma 2.2
Lemma 2.3
Definition 2.4: Merit function
Proposition 2.5: Approximate descent property
Proof 1
Definition 3.1: Time indices
Lemma 3.2
Proof 2
Theorem 3.3: Complexity bound
...and 23 more

A Normal Map-Based Proximal Stochastic Gradient Method: Convergence and Identification Properties

TL;DR

Abstract

A Normal Map-Based Proximal Stochastic Gradient Method: Convergence and Identification Properties

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (33)