Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning
Francois Caron, Fadhel Ayed, Paul Jung, Hoil Lee, Juho Lee, Hongseok Yang
TL;DR
The paper addresses global convergence and feature learning in over-parameterised, shallow neural networks with asymmetrical node scaling. It develops a theoretical framework showing that gradient flow and gradient descent converge to a global minimum for large width when the asymmetry parameter satisfies $\gamma>0$, and that feature learning occurs if and only if $\gamma<1$, while the NTG limit becomes random in this regime. The work introduces and analyzes the Neural Tangent Gram, its mean and limiting forms, and provides sketches of proofs that hinge on decomposing the NTG and tracking its time evolution. Empirical results on simulated and real data corroborate the theory, revealing pruning and transfer-learning benefits afforded by asymmetrical scaling. Overall, the findings demonstrate that non-uniform node scaling can enable feature learning and practical gains beyond what standard NTK scaling achieves in shallow networks.
Abstract
We consider gradient-based optimisation of wide, shallow neural networks, where the output of each hidden node is scaled by a positive parameter. The scaling parameters are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that for large such neural networks, with high probability, gradient flow and gradient descent converge to a global minimum and can learn features in some sense, unlike in the NTK parameterisation. We perform experiments illustrating our theoretical results and discuss the benefits of such scaling in terms of prunability and transfer learning.
