Table of Contents
Fetching ...

On the Convergence of Stochastic Gradient Descent with Perturbed Forward-Backward Passes

Boao Kong, Hengrui Zhang, Kun Yuan

TL;DR

This work characterize how forward and backward perturbations propagate and amplify within a single gradient step, derive convergence guarantees for both general non-convex objectives and functions satisfying the Polyak--Lojasiewicz condition, and identify conditions under which perturbations do not deteriorate the asymptotic convergence order.

Abstract

We study stochastic gradient descent (SGD) for composite optimization problems with $N$ sequential operators subject to perturbations in both the forward and backward passes. Unlike classical analyses that treat gradient noise as additive and localized, perturbations to intermediate outputs and gradients cascade through the computational graph, compounding geometrically with the number of operators. We present the first comprehensive theoretical analysis of this setting. Specifically, we characterize how forward and backward perturbations propagate and amplify within a single gradient step, derive convergence guarantees for both general non-convex objectives and functions satisfying the Polyak--Łojasiewicz condition, and identify conditions under which perturbations do not deteriorate the asymptotic convergence order. As a byproduct, our analysis furnishes a theoretical explanation for the gradient spiking phenomenon widely observed in deep learning, precisely characterizing the conditions under which training recovers from spikes or diverges. Experiments on logistic regression with convex and non-convex regularization validate our theories, illustrating the predicted spike behavior and the asymmetric sensitivity to forward versus backward perturbations.

On the Convergence of Stochastic Gradient Descent with Perturbed Forward-Backward Passes

TL;DR

This work characterize how forward and backward perturbations propagate and amplify within a single gradient step, derive convergence guarantees for both general non-convex objectives and functions satisfying the Polyak--Lojasiewicz condition, and identify conditions under which perturbations do not deteriorate the asymptotic convergence order.

Abstract

We study stochastic gradient descent (SGD) for composite optimization problems with sequential operators subject to perturbations in both the forward and backward passes. Unlike classical analyses that treat gradient noise as additive and localized, perturbations to intermediate outputs and gradients cascade through the computational graph, compounding geometrically with the number of operators. We present the first comprehensive theoretical analysis of this setting. Specifically, we characterize how forward and backward perturbations propagate and amplify within a single gradient step, derive convergence guarantees for both general non-convex objectives and functions satisfying the Polyak--Łojasiewicz condition, and identify conditions under which perturbations do not deteriorate the asymptotic convergence order. As a byproduct, our analysis furnishes a theoretical explanation for the gradient spiking phenomenon widely observed in deep learning, precisely characterizing the conditions under which training recovers from spikes or diverges. Experiments on logistic regression with convex and non-convex regularization validate our theories, illustrating the predicted spike behavior and the asymmetric sensitivity to forward versus backward perturbations.
Paper Structure (33 sections, 11 theorems, 80 equations, 9 figures, 1 table)

This paper contains 33 sections, 11 theorems, 80 equations, 9 figures, 1 table.

Key Result

Lemma 1

Under Assumption assumption:smoothness of the components, there exists a constant $C_v^2 \geq 1$ such that $\Vert v_i^{(t)}\Vert^2 \leq C_v^2$ for all $t = 1, 2, \ldots, T$ and $i = 1, 2, \ldots, N$.

Figures (9)

  • Figure 1: An illustration of SGD algorithm with perturbed forward and backward passes.
  • Figure 2: Comparison of gradient spike patterns leading to contrasting outcomes. The gradient norm (left) and loss (right) trajectories illustrate two scenarios: a large-magnitude spike (orange) that permits rapid recovery and convergence, versus a more moderate spike (blue) that triggers persistent deviation and eventual non-convergence.
  • Figure 3: The convergence performance with forward and backward computation error with different step size $\gamma$ for the logistic regression task with non-convex regularization. (Left: $\sigma_f=2.0$, $\sigma_b=0.0$. Right: $\sigma_f=0.0$, $\sigma_b=2.0$.)
  • Figure 4: The relationship between the stable gradient norm and the step size $\gamma$ with forward and backward computation error for the logistic regression task with non-convex regularization. (Left: Varied $\sigma_f$ with $\sigma_b=0$. Right: Varied $\sigma_b$ with $\sigma_f=0$.)
  • Figure 5: The relationship between the stable gradient norm and the step size $\gamma$ with forward and backward computation perturbations for the logistic regression task with strongly convex (hence PL) regularization. (Left: Varied $\sigma_f$ with $\sigma_b=0$. Right: Varied $\sigma_b$ with $\sigma_f=0$.)
  • ...and 4 more figures

Theorems & Definitions (27)

  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Lemma 2
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • ...and 17 more