Table of Contents
Fetching ...

When narrower is better: the narrow width limit of Bayesian parallel branching neural networks

Zechen Zhang, Haim Sompolinsky

TL;DR

This work challenges the conventional wisdom that wider networks generalize better by analyzing the narrow width limit of Bayesian parallel branching networks (BPB-NNs), focusing on BPB-GCN and its extensions to residual-MLP. By formulating a Bayesian regression framework and performing a kernel renormalization analysis, it derives exact generalization expressions in the asymptotic regime with finite $ rac{P}{N}$ and shows that bias can decay and saturate at narrow widths due to branch-wise kernel changes, while branch norms exhibit equipartition with teacher branches in a student–teacher setting. The results reveal a symmetry-breaking mechanism that differentiates branches and enables robust learning, with empirical validation on Cora and applications to residual-MLP, suggesting that narrow-width regimes can rival or surpass wide-width performance in bias-limited scenarios. The findings introduce a generalized equipartition theorem for branching architectures and indicate that narrow width can act as a natural regularizer, with implications for designing stable, interpretable, and data-reflective neural networks across graph-based and more general branching structures.

Abstract

The infinite width limit of random neural networks is known to result in Neural Networks as Gaussian Process (NNGP) (Lee et al. (2018)), characterized by task-independent kernels. It is widely accepted that larger network widths contribute to improved generalization (Park et al. (2019)). However, this work challenges this notion by investigating the narrow width limit of the Bayesian Parallel Branching Neural Network (BPB-NN), an architecture that resembles neural networks with residual blocks. We demonstrate that when the width of a BPB-NN is significantly smaller compared to the number of training examples, each branch exhibits more robust learning due to a symmetry breaking of branches in kernel renormalization. Surprisingly, the performance of a BPB-NN in the narrow width limit is generally superior to or comparable to that achieved in the wide width limit in bias-limited scenarios. Furthermore, the readout norms of each branch in the narrow width limit are mostly independent of the architectural hyperparameters but generally reflective of the nature of the data. We demonstrate such phenomenon primarily in the branching graph neural networks, where each branch represents a different order of convolutions of the graph; we also extend the results to other more general architectures such as the residual-MLP and demonstrate that the narrow width effect is a general feature of the branching networks. Our results characterize a newly defined narrow-width regime for parallel branching networks in general.

When narrower is better: the narrow width limit of Bayesian parallel branching neural networks

TL;DR

This work challenges the conventional wisdom that wider networks generalize better by analyzing the narrow width limit of Bayesian parallel branching networks (BPB-NNs), focusing on BPB-GCN and its extensions to residual-MLP. By formulating a Bayesian regression framework and performing a kernel renormalization analysis, it derives exact generalization expressions in the asymptotic regime with finite and shows that bias can decay and saturate at narrow widths due to branch-wise kernel changes, while branch norms exhibit equipartition with teacher branches in a student–teacher setting. The results reveal a symmetry-breaking mechanism that differentiates branches and enables robust learning, with empirical validation on Cora and applications to residual-MLP, suggesting that narrow-width regimes can rival or surpass wide-width performance in bias-limited scenarios. The findings introduce a generalized equipartition theorem for branching architectures and indicate that narrow width can act as a natural regularizer, with implications for designing stable, interpretable, and data-reflective neural networks across graph-based and more general branching structures.

Abstract

The infinite width limit of random neural networks is known to result in Neural Networks as Gaussian Process (NNGP) (Lee et al. (2018)), characterized by task-independent kernels. It is widely accepted that larger network widths contribute to improved generalization (Park et al. (2019)). However, this work challenges this notion by investigating the narrow width limit of the Bayesian Parallel Branching Neural Network (BPB-NN), an architecture that resembles neural networks with residual blocks. We demonstrate that when the width of a BPB-NN is significantly smaller compared to the number of training examples, each branch exhibits more robust learning due to a symmetry breaking of branches in kernel renormalization. Surprisingly, the performance of a BPB-NN in the narrow width limit is generally superior to or comparable to that achieved in the wide width limit in bias-limited scenarios. Furthermore, the readout norms of each branch in the narrow width limit are mostly independent of the architectural hyperparameters but generally reflective of the nature of the data. We demonstrate such phenomenon primarily in the branching graph neural networks, where each branch represents a different order of convolutions of the graph; we also extend the results to other more general architectures such as the residual-MLP and demonstrate that the narrow width effect is a general feature of the branching networks. Our results characterize a newly defined narrow-width regime for parallel branching networks in general.
Paper Structure (39 sections, 70 equations, 16 figures)

This paper contains 39 sections, 70 equations, 16 figures.

Figures (16)

  • Figure 1: Overview of the main takeaway: BPB-GCN learns robust representations for each branch at narrow width. (a) The parallel branching GCN architecture, with 2 branches. The independent branches have non-sharing weights and produce the final output $f$ as a sum of branch-level readouts $f_l$. (b) Student and teacher readout norms squared for wide and narrow student BPB-GCN networks. The student network with width $N$ is trained with the teacher network's output. Histograms correspond to the samples from Hamiltonian Monte Carlo simulations and solid lines correspond to the order parameters calculated theoretically. $\sigma_t = \sigma_w = 1$. At $N=4$, the HMC samples of branch readout norms squared (orange and red histograms) for the student network $\frac{\|a_l\|^2}{N} \sigma_w^2$ concentrate at their respective theoretical values $u_l\sigma_w^2$ and overlap with the teacher's readout norms squared $\frac{\|A_l\|^2}{N} \sigma_t^2$ (orange and red dashed lines) for corresponding branches. At $N=1024$ the samples for the student network (blue and green histograms) concentrate at their respective theoretical values but remain far from the teacher's values, instead approaching the GP limit $\sigma_w^4$ (blue dashed line).
  • Figure 2: Statistical average of student readout norms squared as a function of network width from theory and HMC sampling, for student-teacher tasks described in Section \ref{['ssec:symmetry']}. (a): $\langle \|a_l\|^2 \rangle\sigma_w^2/N$ as a function of network width $N$ for a fixed $\sigma_w$. The branch norms break the GP symmetry as it goes to the narrow width limit. (b)(c): Branch 0 and branch 1 readout norm squared respectively for a range of $\sigma_w$ regularization values. The student branch norms with different regularization strengths all converge to the same teacher readout norm values at narrow width.
  • Figure 3: Student network squared bias and variance for individual branches as a function of network width $N$ and regularization strength $\sigma_w$. The mean and variance of branch $l$ readout $f_l^{\mu}$ for node $\mu$ is calculated in \ref{['appx:generalization']} and the bias and variance for branch $l$ can be infered similarly as Eq. \ref{['eq:biasvar']}. Generalization values are normalized over the average true readout labels.
  • Figure 4: Student network generalization performance as a function of network width $N$ and regularization strength $\sigma_w$. Generalization is normalized over the average true readout labels.
  • Figure 5: Cora generalization performance vs. network width $N$ and branch number $L$, for various regularization strength $\sigma_w$'s. The accuracy is computed by turning the mean predictor from HMC samples into a class label using its sign.
  • ...and 11 more figures