Table of Contents
Fetching ...

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

Kaifeng Lyu, Jian Li

TL;DR

This work tackles why gradient-based training of homogeneous neural networks biases solutions toward large margins. It shows that, after the training loss falls below a threshold, a smoothed normalized margin grows monotonically and that the training dynamics converge (in direction) to a KKT point of a max-margin optimization, extending prior linear results to nonlinear homogeneous nets. It delivers precise convergence rates for loss and weight growth and demonstrates robustness benefits in practice. The framework accommodates exponential, logistic, and cross-entropy losses and links to kernel SVM interpretations, offering a unifying view of margin-based implicit regularization with practical implications for longer training and robustness improvements.

Abstract

In this paper, we study the implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations. In particular, we study the gradient descent or gradient flow (i.e., gradient descent with infinitesimal step size) optimizing the logistic loss or cross-entropy loss of any homogeneous model (possibly non-smooth), and show that if the training loss decreases below a certain threshold, then we can define a smoothed version of the normalized margin which increases over time. We also formulate a natural constrained optimization problem related to margin maximization, and prove that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem. Our results generalize the previous results for logistic regression with one-layer or multi-layer linear networks, and provide more quantitative convergence results with weaker assumptions than previous results for homogeneous smooth neural networks. We conduct several experiments to justify our theoretical finding on MNIST and CIFAR-10 datasets. Finally, as margin is closely related to robustness, we discuss potential benefits of training longer for improving the robustness of the model.

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

TL;DR

This work tackles why gradient-based training of homogeneous neural networks biases solutions toward large margins. It shows that, after the training loss falls below a threshold, a smoothed normalized margin grows monotonically and that the training dynamics converge (in direction) to a KKT point of a max-margin optimization, extending prior linear results to nonlinear homogeneous nets. It delivers precise convergence rates for loss and weight growth and demonstrates robustness benefits in practice. The framework accommodates exponential, logistic, and cross-entropy losses and links to kernel SVM interpretations, offering a unifying view of margin-based implicit regularization with practical implications for longer training and robustness improvements.

Abstract

In this paper, we study the implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations. In particular, we study the gradient descent or gradient flow (i.e., gradient descent with infinitesimal step size) optimizing the logistic loss or cross-entropy loss of any homogeneous model (possibly non-smooth), and show that if the training loss decreases below a certain threshold, then we can define a smoothed version of the normalized margin which increases over time. We also formulate a natural constrained optimization problem related to margin maximization, and prove that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem. Our results generalize the previous results for logistic regression with one-layer or multi-layer linear networks, and provide more quantitative convergence results with weaker assumptions than previous results for homogeneous smooth neural networks. We conduct several experiments to justify our theoretical finding on MNIST and CIFAR-10 datasets. Finally, as margin is closely related to robustness, we discuss potential benefits of training longer for improving the robustness of the model.

Paper Structure

This paper contains 61 sections, 51 theorems, 171 equations, 8 figures, 1 table.

Key Result

Theorem 4.1

Under assumptions (A1) - (A4), there exists an $O(\| {\bm{\theta}} \|_2^{-L})$-additive approximation function $\tilde{\gamma}({\bm{\theta}})$ for the normalized margin such that the following statements are true for gradient flow:

Figures (8)

  • Figure 1: (a) Training CNNs with and without bias on MNIST, using SGD with learning rate $0.01$. The training loss (left) decreases over time, and the normalized margin (right) keeps increasing after the model is fitted, but the growth rate is slow ($\approx 1.8 \times 10^{-4}$ after $10000$ epochs). (b) Training CNNs with and without bias on MNIST, using SGD with the loss-based learning rate scheduler. The training loss (left) decreases exponentially over time ($< 10^{-800}$ after $9000$ epochs), and the normalized margin (right) increases rapidly after the model is fitted ($\approx 1.2 \times 10^{-3}$ after $10000$ epochs, $10 \times$ larger than that of SGD with learning rate $0.01$). Experimental details are in Appendix \ref{['sec:exper']}.
  • Figure 2: A plot for the Mexican Hat function $f(u,v)$.
  • Figure 3: Training CNNs with and without bias on MNIST, using SGD with learning rate $0.01$. The training accuracy (left) increases to $100\%$ after about $100$ epochs, and the normalized margin with the original definition (right) keeps increasing after the model is fitted.
  • Figure 4: Training CNNs with and without bias on MNIST, using SGD with the loss-based learning rate scheduler. The training accuracy (left) increases to $100\%$ after about $20$ epochs, and the normalized margin with the original definition (middle) increases rapidly after the model is fitted. The right figure shows the change of the relative learning rate $\alpha(t)$ (see \ref{['eq:eta-param']} for its definition) during training.
  • Figure 5: Training VGGNet with and without bias on CIFAR-10, using SGD with learning rate $0.1$.
  • ...and 3 more figures

Theorems & Definitions (102)

  • Theorem 4.1: Corollary of Theorem \ref{['thm:general-loss-margin-inc']}
  • Theorem 4.2: Corollary of Theorem \ref{['thm:gd-margin-inc']}
  • Theorem 4.3: Corollary of Theorem \ref{['thm:main-loss-tail']} and \ref{['thm:gd-tight-rate']}
  • Theorem 4.4: Corollary of Theorem \ref{['thm:main-limit-dir-converge']} and \ref{['thm:gd-exp-loss-kkt']}
  • Corollary 4.5: Corollary of Theorem \ref{['thm:exp-loss-kkt']}
  • Lemma 5.1: Corollary of Lemma \ref{['lam:key-lemma-margin-general']}
  • proof : Proof Sketch of Lemma \ref{['lam:key-lemma-margin']}
  • Remark A.1
  • Remark A.2
  • proof : Proof for Remark \ref{['remark:log-b3']}
  • ...and 92 more