Table of Contents
Fetching ...

Improving Generalization of Deep Neural Networks by Leveraging Margin Distribution

Shen-Huan Lyu, Lu Wang, Zhi-Hua Zhou

TL;DR

The paper reframes generalization in deep neural networks through margin distribution instead of the traditional minimum margin, introducing mean margin $r$, margin standard deviation $\theta$, and margin ratio $\lambda=\theta/r$. It proves a PAC-Bayesian generalization bound based on the entire margin distribution, showing the bound tightens as $\lambda$ decreases, and proposes Margin Distribution Networks (mdNet) with a convex loss $\ell_{r,\theta,\eta}$ to optimize the margin band around $r$. Empirical results on MNIST, CIFAR-10, and ImageNet demonstrate mdNet yields better generalization, more separable representations, and faster convergence, especially under limited data. The work provides both theoretical and practical tools for margin-distribution-aware learning, with implications for regularization and model capacity control in deep learning.

Abstract

Recent research has used margin theory to analyze the generalization performance for deep neural networks (DNNs). The existed results are almost based on the spectrally-normalized minimum margin. However, optimizing the minimum margin ignores a mass of information about the entire margin distribution, which is crucial to generalization performance. In this paper, we prove a generalization upper bound dominated by the statistics of the entire margin distribution. Compared with the minimum margin bounds, our bound highlights an important measure for controlling the complexity, which is the ratio of the margin standard deviation to the expected margin. We utilize a convex margin distribution loss function on the deep neural networks to validate our theoretical results by optimizing the margin ratio. Experiments and visualizations confirm the effectiveness of our approach and the correlation between generalization gap and margin ratio.

Improving Generalization of Deep Neural Networks by Leveraging Margin Distribution

TL;DR

The paper reframes generalization in deep neural networks through margin distribution instead of the traditional minimum margin, introducing mean margin , margin standard deviation , and margin ratio . It proves a PAC-Bayesian generalization bound based on the entire margin distribution, showing the bound tightens as decreases, and proposes Margin Distribution Networks (mdNet) with a convex loss to optimize the margin band around . Empirical results on MNIST, CIFAR-10, and ImageNet demonstrate mdNet yields better generalization, more separable representations, and faster convergence, especially under limited data. The work provides both theoretical and practical tools for margin-distribution-aware learning, with implications for regularization and model capacity control in deep learning.

Abstract

Recent research has used margin theory to analyze the generalization performance for deep neural networks (DNNs). The existed results are almost based on the spectrally-normalized minimum margin. However, optimizing the minimum margin ignores a mass of information about the entire margin distribution, which is crucial to generalization performance. In this paper, we prove a generalization upper bound dominated by the statistics of the entire margin distribution. Compared with the minimum margin bounds, our bound highlights an important measure for controlling the complexity, which is the ratio of the margin standard deviation to the expected margin. We utilize a convex margin distribution loss function on the deep neural networks to validate our theoretical results by optimizing the margin ratio. Experiments and visualizations confirm the effectiveness of our approach and the correlation between generalization gap and margin ratio.

Paper Structure

This paper contains 22 sections, 7 theorems, 41 equations, 7 figures, 1 table.

Key Result

Theorem 4.6

bartlett17spectrallyneyshabur18spectrally For any $d,\rho>0$ and $\|\boldsymbol{x}\|_2\leq B$, let $f_{\bm{w}}:\mathcal{X} \rightarrow \mathbb{R}^k$ be a $d$-layer feed-forward network with ReLU activation. Then, for any $\delta > 0$, with probability $\geq 1-\delta$ over a training set of size $m$, where $L_0(\cdot)$ is the 0-1 loss, $\widehat{L}_{\gamma}(\cdot)=\Pr_{S} \left[\gamma_{h}(\boldsymb

Figures (7)

  • Figure 1: Illustration of the margin distribution analysis and loss functions.
  • Figure 2: Comparing our bound and [1] to empirical generalization error during training. All bounds are rescaled to be within the same range as the generalization error together.
  • Figure 3: Illustration of the relationship between margin distribution and allowable perturbation.
  • Figure 4: The quality of feature representations generated by different models on the MNIST, CIFAR-10 and ImageNet datasets.
  • Figure 5: Test error and margin ratio across epochs on mdNet models for MNIST, CIFAR-10 and ImageNet datasets.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Theorem 4.6
  • Theorem 4.7
  • Definition 4.8
  • Lemma 4.9
  • Lemma 4.10
  • Theorem 4.11
  • Lemma 5.1
  • Lemma 5.2
  • Definition 6.1