Table of Contents
Fetching ...

Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning

Francois Caron, Fadhel Ayed, Paul Jung, Hoil Lee, Juho Lee, Hongseok Yang

TL;DR

The paper addresses global convergence and feature learning in over-parameterised, shallow neural networks with asymmetrical node scaling. It develops a theoretical framework showing that gradient flow and gradient descent converge to a global minimum for large width when the asymmetry parameter satisfies $\gamma>0$, and that feature learning occurs if and only if $\gamma<1$, while the NTG limit becomes random in this regime. The work introduces and analyzes the Neural Tangent Gram, its mean and limiting forms, and provides sketches of proofs that hinge on decomposing the NTG and tracking its time evolution. Empirical results on simulated and real data corroborate the theory, revealing pruning and transfer-learning benefits afforded by asymmetrical scaling. Overall, the findings demonstrate that non-uniform node scaling can enable feature learning and practical gains beyond what standard NTK scaling achieves in shallow networks.

Abstract

We consider gradient-based optimisation of wide, shallow neural networks, where the output of each hidden node is scaled by a positive parameter. The scaling parameters are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that for large such neural networks, with high probability, gradient flow and gradient descent converge to a global minimum and can learn features in some sense, unlike in the NTK parameterisation. We perform experiments illustrating our theoretical results and discuss the benefits of such scaling in terms of prunability and transfer learning.

Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning

TL;DR

The paper addresses global convergence and feature learning in over-parameterised, shallow neural networks with asymmetrical node scaling. It develops a theoretical framework showing that gradient flow and gradient descent converge to a global minimum for large width when the asymmetry parameter satisfies , and that feature learning occurs if and only if , while the NTG limit becomes random in this regime. The work introduces and analyzes the Neural Tangent Gram, its mean and limiting forms, and provides sketches of proofs that hinge on decomposing the NTG and tracking its time evolution. Empirical results on simulated and real data corroborate the theory, revealing pruning and transfer-learning benefits afforded by asymmetrical scaling. Overall, the findings demonstrate that non-uniform node scaling can enable feature learning and practical gains beyond what standard NTK scaling achieves in shallow networks.

Abstract

We consider gradient-based optimisation of wide, shallow neural networks, where the output of each hidden node is scaled by a positive parameter. The scaling parameters are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that for large such neural networks, with high probability, gradient flow and gradient descent converge to a global minimum and can learn features in some sense, unlike in the NTK parameterisation. We perform experiments illustrating our theoretical results and discuss the benefits of such scaling in terms of prunability and transfer learning.
Paper Structure (72 sections, 27 theorems, 265 equations, 23 figures)

This paper contains 72 sections, 27 theorems, 265 equations, 23 figures.

Key Result

Proposition 4.1

When assump:dataassump:activationassump:init_V hold, we have $\kappa_n>0$.

Figures (23)

  • Figure 1: Results on simulated data. From left to right, 1) training risks, 2) differences in weight norms $\Vert \mathbf{w}_{tj} - \mathbf{w}_{0j}\Vert$ with the $j$'s being those neurons which have maximal differences at the end of the training, 3) differences in NTG matrices, and 4) minimum eigenvalues of NTG matrices.
  • Figure 2: Results on simulated data from a single ReLU unit. Evolution of the training error (left) and test error (right) as a function of the training iteration.
  • Figure 3: A subset of results for the regression experiments. From left to right, 1) training risks for dataset concrete , 2) differences in weight norms $\Vert \mathbf{w}_{tj} - \mathbf{w}_{0j}\Vert$ with the $j$'s being the neurons having the maximal difference at the end of the training for dataset energy, 3) differences in NTG matrices for dataset airfoil, and 4) test risks of transferred models for dataset plant.
  • Figure 4: A subset of results for MNIST dataset. From left to right, 1) training risks, 2) differences in weight norms, 3) test accuracies of pruned models, and 4) test accuracies of transferred models.
  • Figure 5: Results for CIFAR--100. From left to right, 1) test risk through training, 2) differences in weight norms $\Vert \mathbf{w}_{tj} - \mathbf{w}_{0j}\Vert$ with the $j$'s being the neurons having the maximal difference at the end of training, 3) test risks of pruned models, and 4) test accuracies of pruned models.
  • ...and 18 more figures

Theorems & Definitions (45)

  • Proposition 4.1: Du2019 and Du2019a
  • Remark 4.2
  • Proposition 4.3
  • Theorem 5.1
  • Remark 5.2
  • Theorem 6.1
  • Definition 7.1: Feature learning
  • Remark 7.2
  • Definition 7.3: Non-uniform feature learning
  • Theorem 7.4
  • ...and 35 more