Table of Contents
Fetching ...

Single-Timescale Multi-Sequence Stochastic Approximation Without Fixed Point Smoothness: Theories and Applications

Yue Huang, Zhaoxian Wu, Shiqian Ma, Qing Ling

TL;DR

The theoretical findings reveal that, when all involved operators are strongly monotone except for the main one, MSSA converges at a rate of <inline-formula><tex-math notation="LaTeX">$O(K^{-\frac{1}{2}})$</tex-math></inline-formula>.

Abstract

Stochastic approximation (SA) that involves multiple coupled sequences, known as multiple-sequence SA (MSSA), finds diverse applications in the fields of signal processing and machine learning. However, existing theoretical understandings {of} MSSA are limited: the multi-timescale analysis implies a slow convergence rate, whereas the single-timescale analysis relies on a stringent fixed point smoothness assumption. This paper establishes tighter single-timescale analysis for MSSA, without assuming smoothness of the fixed points. Our theoretical findings reveal that, when all involved operators are strongly monotone, MSSA converges at a rate of $\tilde{\mathcal{O}}(K^{-1})$, where $K$ denotes the total number of iterations. In addition, when all involved operators are strongly monotone except for the main one, MSSA converges at a rate of $\mathcal{O}(K^{-\frac{1}{2}})$. These theoretical findings align with those established for single-sequence SA. Applying these theoretical findings to bilevel optimization and communication-efficient distributed learning offers relaxed assumptions and/or simpler algorithms with performance guarantees, as validated by numerical experiments.

Single-Timescale Multi-Sequence Stochastic Approximation Without Fixed Point Smoothness: Theories and Applications

TL;DR

The theoretical findings reveal that, when all involved operators are strongly monotone except for the main one, MSSA converges at a rate of <inline-formula><tex-math notation="LaTeX"></tex-math></inline-formula>.

Abstract

Stochastic approximation (SA) that involves multiple coupled sequences, known as multiple-sequence SA (MSSA), finds diverse applications in the fields of signal processing and machine learning. However, existing theoretical understandings {of} MSSA are limited: the multi-timescale analysis implies a slow convergence rate, whereas the single-timescale analysis relies on a stringent fixed point smoothness assumption. This paper establishes tighter single-timescale analysis for MSSA, without assuming smoothness of the fixed points. Our theoretical findings reveal that, when all involved operators are strongly monotone, MSSA converges at a rate of , where denotes the total number of iterations. In addition, when all involved operators are strongly monotone except for the main one, MSSA converges at a rate of . These theoretical findings align with those established for single-sequence SA. Applying these theoretical findings to bilevel optimization and communication-efficient distributed learning offers relaxed assumptions and/or simpler algorithms with performance guarantees, as validated by numerical experiments.

Paper Structure

This paper contains 29 sections, 12 theorems, 170 equations, 4 figures, 3 tables.

Key Result

Lemma 1

Suppose Assumptions asp-sub-sm and asp-lc-sub hold. For any $n$, given any $x,y_{1:n-1}$, there exists a unique fixed point $y_n^\ast(x,y_{1:n-1})$ satisfying In addition, $y_n^\ast(x,y_{1:n-1})$ is $L_{y,n}$-Lipschitz continuous w.r.t. any variable among $x$ and $y_{1:n-1}$, where $L_{y,n}=\frac{\ell_n}{\mu_n}$ for Condition (a) in Assumption asp-lc-sub, and $L_{y,n}=\frac{\ell_{b,n}}{\mu_n}+\fr

Figures (4)

  • Figure 1: Analytic ideas are illustrated for two sequences with 3 iterations ($N=1,\, K=3$). Previous works doan2022nonlinearshen2022single traces the distance between $y^k$ and time-varying fixed point $y^\ast(x^k)$ (left). Our work characterizes the convergence error $y^K-y^\ast(x^K)$ based on an auxiliary sequence \ref{['eq:SSSA']} (right). Here we use $\hat{y}^k$ to denote the sequence generated by \ref{['eq:SSSA']}.
  • Figure 2: Validation loss with different classifiers, averaged over 10 runs.
  • Figure 3: $\ell_2$-SVM with different compression rates $p$. Gradient Norm Squared: $\|\frac{1}{N}\sum_{n=1}^N\nabla f_n(x^k)\|^2$. Momentum Bias: $\frac{1}{N}\sum_{n=1}^N\|y_n^k-\nabla f_n(x^k)\|^2$.
  • Figure 4: Training loss with different compression rates $p$. ST and TT represent single-timescale and two-timescale step sizes, respectively.

Theorems & Definitions (17)

  • Lemma 1: Existence, uniqueness and Lipschitz continuity of fixed points
  • Remark 1
  • Remark 2
  • Lemma 2: Convergence of secondary sequences
  • Lemma 3
  • Theorem 1
  • Lemma 4
  • Theorem 2
  • Lemma 5: Verifying assumptions of MSSA
  • Corollary 1
  • ...and 7 more