When narrower is better: the narrow width limit of Bayesian parallel branching neural networks
Zechen Zhang, Haim Sompolinsky
TL;DR
This work challenges the conventional wisdom that wider networks generalize better by analyzing the narrow width limit of Bayesian parallel branching networks (BPB-NNs), focusing on BPB-GCN and its extensions to residual-MLP. By formulating a Bayesian regression framework and performing a kernel renormalization analysis, it derives exact generalization expressions in the asymptotic regime with finite $rac{P}{N}$ and shows that bias can decay and saturate at narrow widths due to branch-wise kernel changes, while branch norms exhibit equipartition with teacher branches in a student–teacher setting. The results reveal a symmetry-breaking mechanism that differentiates branches and enables robust learning, with empirical validation on Cora and applications to residual-MLP, suggesting that narrow-width regimes can rival or surpass wide-width performance in bias-limited scenarios. The findings introduce a generalized equipartition theorem for branching architectures and indicate that narrow width can act as a natural regularizer, with implications for designing stable, interpretable, and data-reflective neural networks across graph-based and more general branching structures.
Abstract
The infinite width limit of random neural networks is known to result in Neural Networks as Gaussian Process (NNGP) (Lee et al. (2018)), characterized by task-independent kernels. It is widely accepted that larger network widths contribute to improved generalization (Park et al. (2019)). However, this work challenges this notion by investigating the narrow width limit of the Bayesian Parallel Branching Neural Network (BPB-NN), an architecture that resembles neural networks with residual blocks. We demonstrate that when the width of a BPB-NN is significantly smaller compared to the number of training examples, each branch exhibits more robust learning due to a symmetry breaking of branches in kernel renormalization. Surprisingly, the performance of a BPB-NN in the narrow width limit is generally superior to or comparable to that achieved in the wide width limit in bias-limited scenarios. Furthermore, the readout norms of each branch in the narrow width limit are mostly independent of the architectural hyperparameters but generally reflective of the nature of the data. We demonstrate such phenomenon primarily in the branching graph neural networks, where each branch represents a different order of convolutions of the graph; we also extend the results to other more general architectures such as the residual-MLP and demonstrate that the narrow width effect is a general feature of the branching networks. Our results characterize a newly defined narrow-width regime for parallel branching networks in general.
