Table of Contents
Fetching ...

The Stochastic Proximal Distance Algorithm

Haoyu Jiang, Jason Xu

TL;DR

The paper presents a stochastic proximal distance (SPD) algorithm for constrained optimization, linking proximal-distance reformulations to stochastic and implicit-gradient methods. By allowing the penalty parameter to grow as ρ_k = ρ_1 k^γ, with 0.5 < γ ≤ 1, it proves almost-sure convergence of projected iterates and derives finite-error and rate bounds for both the iterate and objective under convex assumptions. The authors provide extensive empirical validation across convex and nonconvex constraint sets (e.g., unit ball, sparsity, low-rank), demonstrating that SPD scales to large datasets and frequently outperforms its batch counterpart and projected SGD in practice. The results shed light on the role of the learning-rate-like 1/ρ_k in stochastic proximal schemes, offering practical guidelines for hyperparameter choices and illustrating substantial computational benefits for large-scale constrained learning tasks.

Abstract

Stochastic versions of proximal methods have gained much attention in statistics and machine learning. These algorithms tend to admit simple, scalable forms, and enjoy numerical stability via implicit updates. In this work, we propose and analyze a stochastic version of the recently proposed proximal distance algorithm, a class of iterative optimization methods that recover a desired constrained estimation problem as a penalty parameter $ρ\rightarrow \infty$. By uncovering connections to related stochastic proximal methods and interpreting the penalty parameter as the learning rate, we justify heuristics used in practical manifestations of the proximal distance method, establishing their convergence guarantees for the first time. Moreover, we extend recent theoretical devices to establish finite error bounds and a complete characterization of convergence rates regimes. We validate our analysis via a thorough empirical study, also showing that unsurprisingly, the proposed method outpaces batch versions on popular learning tasks.

The Stochastic Proximal Distance Algorithm

TL;DR

The paper presents a stochastic proximal distance (SPD) algorithm for constrained optimization, linking proximal-distance reformulations to stochastic and implicit-gradient methods. By allowing the penalty parameter to grow as ρ_k = ρ_1 k^γ, with 0.5 < γ ≤ 1, it proves almost-sure convergence of projected iterates and derives finite-error and rate bounds for both the iterate and objective under convex assumptions. The authors provide extensive empirical validation across convex and nonconvex constraint sets (e.g., unit ball, sparsity, low-rank), demonstrating that SPD scales to large datasets and frequently outperforms its batch counterpart and projected SGD in practice. The results shed light on the role of the learning-rate-like 1/ρ_k in stochastic proximal schemes, offering practical guidelines for hyperparameter choices and illustrating substantial computational benefits for large-scale constrained learning tasks.

Abstract

Stochastic versions of proximal methods have gained much attention in statistics and machine learning. These algorithms tend to admit simple, scalable forms, and enjoy numerical stability via implicit updates. In this work, we propose and analyze a stochastic version of the recently proposed proximal distance algorithm, a class of iterative optimization methods that recover a desired constrained estimation problem as a penalty parameter . By uncovering connections to related stochastic proximal methods and interpreting the penalty parameter as the learning rate, we justify heuristics used in practical manifestations of the proximal distance method, establishing their convergence guarantees for the first time. Moreover, we extend recent theoretical devices to establish finite error bounds and a complete characterization of convergence rates regimes. We validate our analysis via a thorough empirical study, also showing that unsurprisingly, the proposed method outpaces batch versions on popular learning tasks.
Paper Structure (43 sections, 9 theorems, 87 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 43 sections, 9 theorems, 87 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Under Assumption conv_FC, the constrained solution $\bm{\theta}_* = \mathop{\mathrm{arg\,min}}\limits_{\bm{\theta}\in C}\,\,F(\bm{\theta})$ exists and is unique.

Figures (8)

  • Figure 1: Finite error at different $\gamma$ in different settings under the unit ball constraint (top panel) and sparsity set constraint (bottom panel)
  • Figure 2: Finite error in terms of the parameter at different $\gamma$ in different settings under the unit ball constraint (top panel) and sparsity constraint (bottom panel)
  • Figure 3: (a) Performance on real data. "Prox" refers to the proposed method; "Proj" is projected stochastic gradient descent, with batch sizes $b=200$ and $500$; (b) Runtime as $n,b$ vary
  • Figure 4: Finite error at different $\rho_1$ in different settings under the unit ball constraint
  • Figure 5: Box plots of error in various settings under the unit ball constraint.
  • ...and 3 more figures

Theorems & Definitions (16)

  • Proposition 1
  • Proposition 2
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • proof
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • ...and 6 more