Table of Contents
Fetching ...

A path-norm toolkit for modern networks: consequences, promises and challenges

Antoine Gonon, Nicolas Brisebarre, Elisa Riccietti, Rémi Gribonval

TL;DR

This paper develops a versatile path-norm toolkit for general DAG ReLU networks that include biases, skip connections, and pooling, addressing the limitations of previous path-norm definitions. It introduces a generalized path-lifting $\Phi^G(\bm\theta)$ and path-activations $\mathbf{A}^G(\bm\theta,x)$, defining $L^q$ path-norms and mixed path-norms that yield end-to-end Lipschitz bounds and tighter comparisons to products of operator norms. A new generalization bound is derived for cross-entropy loss on arbitrary DAG ReLU architectures, incorporating depth, pooling variety, and output dimensions; contraction lemmas and a peeling argument underpin the bound, which can be tightened via margin-based analyses for top-1 accuracy. Empirical results on ImageNet with ResNets reveal a gap between theory and practice for dense models, while sparsity can substantially reduce the bound, suggesting practical avenues to close the gap. Overall, the work provides the first comprehensive framework for path-norm based generalization on modern networks and highlights concrete directions to bring theory closer to observed performance in real-world settings.

Abstract

This work introduces the first toolkit around path-norms that fully encompasses general DAG ReLU networks with biases, skip connections and any operation based on the extraction of order statistics: max pooling, GroupSort etc. This toolkit notably allows us to establish generalization bounds for modern neural networks that are not only the most widely applicable path-norm based ones, but also recover or beat the sharpest known bounds of this type. These extended path-norms further enjoy the usual benefits of path-norms: ease of computation, invariance under the symmetries of the network, and improved sharpness on layered fully-connected networks compared to the product of operator norms, another complexity measure most commonly used. The versatility of the toolkit and its ease of implementation allow us to challenge the concrete promises of path-norm-based generalization bounds, by numerically evaluating the sharpest known bounds for ResNets on ImageNet.

A path-norm toolkit for modern networks: consequences, promises and challenges

TL;DR

This paper develops a versatile path-norm toolkit for general DAG ReLU networks that include biases, skip connections, and pooling, addressing the limitations of previous path-norm definitions. It introduces a generalized path-lifting and path-activations , defining path-norms and mixed path-norms that yield end-to-end Lipschitz bounds and tighter comparisons to products of operator norms. A new generalization bound is derived for cross-entropy loss on arbitrary DAG ReLU architectures, incorporating depth, pooling variety, and output dimensions; contraction lemmas and a peeling argument underpin the bound, which can be tightened via margin-based analyses for top-1 accuracy. Empirical results on ImageNet with ResNets reveal a gap between theory and practice for dense models, while sparsity can substantially reduce the bound, suggesting practical avenues to close the gap. Overall, the work provides the first comprehensive framework for path-norm based generalization on modern networks and highlights concrete directions to bring theory closer to observed performance in real-world settings.

Abstract

This work introduces the first toolkit around path-norms that fully encompasses general DAG ReLU networks with biases, skip connections and any operation based on the extraction of order statistics: max pooling, GroupSort etc. This toolkit notably allows us to establish generalization bounds for modern neural networks that are not only the most widely applicable path-norm based ones, but also recover or beat the sharpest known bounds of this type. These extended path-norms further enjoy the usual benefits of path-norms: ease of computation, invariance under the symmetries of the network, and improved sharpness on layered fully-connected networks compared to the product of operator norms, another complexity measure most commonly used. The versatility of the toolkit and its ease of implementation allow us to challenge the concrete promises of path-norm-based generalization bounds, by numerically evaluating the sharpest known bounds for ResNets on ImageNet.
Paper Structure (21 sections, 18 theorems, 151 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 18 theorems, 151 equations, 5 figures, 5 tables, 1 algorithm.

Key Result

Theorem 3.1

Consider a ReLU neural network architecture $G$ (def:NN) with null biases, with input/output dimensions ${d_{\textrm{in}}}$/${d_{\textrm{out}}}$. Denote $D$ its depth (the maximal length of a path from an input to an output), $P:=|\{k\in{\mathbb N}_{>0}, \exists u\in N_{k\textrm{-}\mathtt{pool}}\}| Consider $n+1$ iid random variables ${\mathbf Z}_i=({\mathbf X}_i,{\mathbf Y}_i)\sim \mu$, $0 \leqs

Figures (5)

  • Figure 1: Example of a network where one must replace the max-pooling neuron to compute the path-norm with a single forward pass as in \ref{['eq:ComputePathNorm']}.
  • Figure 2: A network which path-norm is zero while the product of operator norms scales as $M^2$.
  • Figure 3: Distribution of the margins on the training set of ImageNet, with the pretrained ResNets available on PyTorch.
  • Figure 4: $L^q$ path-norm ($q=1,2,4$), test top-1 accuracy, training top-1 accuracy, and the top-1 generalization error (difference between test top-1 and train top-1) during the training of a ResNet18 on ImageNet. The pruning iteration is indicated in legend, with $0$ corresponding to the dense network. The color also indicates the degree of sparsity: from dense (black) to extremely sparse (yellow).
  • Figure 5: $L^1$ path-norm, and empirical generalization errors for both the top-1 accuracy and the cross-entropy during the training of a ResNet18 on a subset of the training images of ImageNet. The legend indicates the size of the subset considered, e.g.$1/m$ corresponds to $1/m$ of $99\%$ of ImageNet, leaving the other $1\%$ out for validation. The color also indicates the size of the subset: from small (black) to large (yellow).

Theorems & Definitions (48)

  • Definition 2.1
  • Definition 2.2
  • Definition 3.1
  • Theorem 3.1
  • proof : Sketch of proof for \ref{['thm:GeneralizationBound']}
  • Remark 3.1: Improved bound with assumptions on $*$-max-pooling neurons
  • Theorem 3.2: Bound on the probability of misclassification
  • Definition A.1: Paths and depth in a DAG
  • Definition A.2: Sub-graph ending at a given neuron
  • Definition A.3: Path-lifting and path-activations
  • ...and 38 more