Table of Contents
Fetching ...

On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective

Guy Smorodinsky, Sveta Gimpleson, Itay Safran

TL;DR

It is proved that even under strong simplifying assumptions, while GD successfully converges to an optimal robustness margin, this convergence occurs at a prohibitively slow rate, scaling strictly as $\Theta(1/\ln(t))$.

Abstract

We study the convergence dynamics of Gradient Descent (GD) in a minimal binary classification setting, consisting of a two-neuron ReLU network and two training instances. We prove that even under these strong simplifying assumptions, while GD successfully converges to an optimal robustness margin, effectively maximizing the distance between the decision boundary and the training points, this convergence occurs at a prohibitively slow rate, scaling strictly as $Θ(1/\ln(t))$. To the best of our knowledge, this establishes the first explicit lower bound on the convergence rate of the robustness margin in a non-linear model. Through empirical simulations, we further demonstrate that this inherent failure mode is pervasive, exhibiting the exact same tight convergence rate across multiple natural network initializations. Our theoretical guarantees are derived via a rigorous analysis of the GD trajectories across the distinct activation patterns of the model. Specifically, we develop tight control over the system's dynamics to bound the trajectory of the decision boundary, overcoming the primary technical challenge introduced by the non-linear nature of the architecture.

On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective

TL;DR

It is proved that even under strong simplifying assumptions, while GD successfully converges to an optimal robustness margin, this convergence occurs at a prohibitively slow rate, scaling strictly as .

Abstract

We study the convergence dynamics of Gradient Descent (GD) in a minimal binary classification setting, consisting of a two-neuron ReLU network and two training instances. We prove that even under these strong simplifying assumptions, while GD successfully converges to an optimal robustness margin, effectively maximizing the distance between the decision boundary and the training points, this convergence occurs at a prohibitively slow rate, scaling strictly as . To the best of our knowledge, this establishes the first explicit lower bound on the convergence rate of the robustness margin in a non-linear model. Through empirical simulations, we further demonstrate that this inherent failure mode is pervasive, exhibiting the exact same tight convergence rate across multiple natural network initializations. Our theoretical guarantees are derived via a rigorous analysis of the GD trajectories across the distinct activation patterns of the model. Specifically, we develop tight control over the system's dynamics to bound the trajectory of the decision boundary, overcoming the primary technical challenge introduced by the non-linear nature of the architecture.
Paper Structure (33 sections, 23 theorems, 160 equations, 3 figures)

This paper contains 33 sections, 23 theorems, 160 equations, 3 figures.

Key Result

Theorem 2.2

Let $\Phi(\boldsymbol{\theta};\cdot)$ be a depth-$2$ ReLU neural network parameterized by $\boldsymbol{\theta}$. Consider minimizing either the exponential or the logistic loss over a binary classification dataset $\{(x_i,y_i)\}_{i=1}^n$ using GF. Assume that there exists a time $t_0$ such that $\ma Moreover, $\mathcal{L}(\boldsymbol{\theta}(t)) \to 0$ and $\|\boldsymbol{\theta}(t)\| \to \infty$ a

Figures (3)

  • Figure 1: The limiting network $\Phi(\boldsymbol{\theta}^\star; x)$ for $\boldsymbol{\theta}^\star = (0.5, 0.5, -0.5, 0.5)$. The network achieves an optimal robustness margin $\gamma^\star = 1$, perfectly separating the training samples $(x_1, y_1)$ and $(x_2, y_2)$ indicated by the red and blue markers, respectively.
  • Figure 2: Simplified optimization dynamics after the loss decreases below $0.5$. The training trajectory transitions between different states as depicted by the figure. The intermediate state can bifurcate depending on the initial orientations of the neurons. Moreover, in the final state, a degenerate interval may arise in which no neuron is active; this case requires a separate technical analysis to establish that after a finite number of iterations, the degenerate interval vanishes.
  • Figure 3: Empirical distribution of the lower-bound guarantees obtained in the experiment. Among 10,000 GD initializations, 2,368 converged to a small training loss. Since our simulation conditions all satisfy the monotonicity criterion we establish in Lemma \ref{['lem:monotone_limit']}, this ensures monotone convergence of the absolute value of the numerator in Equation (\ref{['eq:x_star_body']}) to $\tfrac{1}{2}|w_1^{(0)}+w_2^{(0)}|$. This implies the lower bound $\min\!\left\{|b_2^{(t)}-b_1^{(t)}|,\;\tfrac{1}{2} |w_1^{(0)}+w_2^{(0)}|\right\}$. The red curve is the theoretical density (scaled to match histogram counts) of $\tfrac{1}{2}|w_1^{(0)}+w_2^{(0)}|$, given in Equation (\ref{['eq:f_B']}). The close match suggests that this is indeed the distribution of the numerator, not just asymptotically, but one that is also attained in practice.

Theorems & Definitions (44)

  • Definition 2.1: Directional convergence
  • Theorem 2.2: lyu2019gradientji2020directional
  • Theorem 3.1
  • Theorem 5.1
  • Theorem 5.2
  • Corollary 5.3
  • Lemma B.1
  • proof
  • Lemma B.2
  • proof
  • ...and 34 more