Table of Contents
Fetching ...

Solution space and storage capacity of fully connected two-layer neural networks with generic activation functions

Sota Nishiyama, Masayuki Ohzeki

TL;DR

This work analyzes the storage capacity and solution-space topology of fully connected two-layer neural networks with generic activation functions using the replica method. It shows that the storage capacity per parameter remains finite in the infinite-width limit and that hidden weights exhibit negative correlations, leading to a division of labor, along with a dataset-size–driven phase transition where permutation symmetry breaks. Activation functions critically shape the PS transition: RS predicts continuous PSB for ReLU and quadratic activations, but discontinuous PSB for erf, with corresponding spinodal points. Numerical experiments with gradient descent corroborate the qualitative predictions, revealing a gap between algorithmic learnability and the theoretical capacity due to nonconvexity and symmetry breaking, and showing consistent trends across activations.

Abstract

The storage capacity of a binary classification model is the maximum number of random input-output pairs per parameter that the model can learn. It is one of the indicators of the expressive power of machine learning models and is important for comparing the performance of various models. In this study, we analyze the structure of the solution space and the storage capacity of fully connected two-layer neural networks with general activation functions using the replica method from statistical physics. Our results demonstrate that the storage capacity per parameter remains finite even with infinite width and that the weights of the network exhibit negative correlations, leading to a 'division of labor'. In addition, we find that increasing the dataset size triggers a phase transition at a certain transition point where the permutation symmetry of weights is broken, resulting in the solution space splitting into disjoint regions. We identify the dependence of this transition point and the storage capacity on the choice of activation function. These findings contribute to understanding the influence of activation functions and the number of parameters on the structure of the solution space, potentially offering insights for selecting appropriate architectures based on specific objectives.

Solution space and storage capacity of fully connected two-layer neural networks with generic activation functions

TL;DR

This work analyzes the storage capacity and solution-space topology of fully connected two-layer neural networks with generic activation functions using the replica method. It shows that the storage capacity per parameter remains finite in the infinite-width limit and that hidden weights exhibit negative correlations, leading to a division of labor, along with a dataset-size–driven phase transition where permutation symmetry breaks. Activation functions critically shape the PS transition: RS predicts continuous PSB for ReLU and quadratic activations, but discontinuous PSB for erf, with corresponding spinodal points. Numerical experiments with gradient descent corroborate the qualitative predictions, revealing a gap between algorithmic learnability and the theoretical capacity due to nonconvexity and symmetry breaking, and showing consistent trends across activations.

Abstract

The storage capacity of a binary classification model is the maximum number of random input-output pairs per parameter that the model can learn. It is one of the indicators of the expressive power of machine learning models and is important for comparing the performance of various models. In this study, we analyze the structure of the solution space and the storage capacity of fully connected two-layer neural networks with general activation functions using the replica method from statistical physics. Our results demonstrate that the storage capacity per parameter remains finite even with infinite width and that the weights of the network exhibit negative correlations, leading to a 'division of labor'. In addition, we find that increasing the dataset size triggers a phase transition at a certain transition point where the permutation symmetry of weights is broken, resulting in the solution space splitting into disjoint regions. We identify the dependence of this transition point and the storage capacity on the choice of activation function. These findings contribute to understanding the influence of activation functions and the number of parameters on the structure of the solution space, potentially offering insights for selecting appropriate architectures based on specific objectives.
Paper Structure (23 sections, 82 equations, 5 figures, 1 table)

This paper contains 23 sections, 82 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Architectures of a tree-like committee machine (TCM, left) and a fully connected committee machine (FCM, right).
  • Figure 2: (Color online) The overlap between replicas $q$ at $\kappa=0$ for activations ReLU, erf, and quadratic $x^2$. We solved the saddle-point equations \ref{['eq:sp_rs_qhat']}--\ref{['eq:sp_rs_dbar']} iteratively starting from an appropriate set of initial values. For ReLU and quadratic, the transition from the PS phase with $q=0$ to the PSB phase with $q > 0$ is continuous. The transition point $\alpha_\text{PS}$ is approximately $0.897$ for ReLU and $0.785$ for quadratic. On the other hand, in the case of erf activation, we observe a discontinuous transition to the PSB phase. At $\alpha_\text{spin}\approx 3.949$, a local minimum of $f_\text{RS}$ with $q > 0$ appears (dashed line). Above $\alpha_\text{PS}\approx 4.142$, the $q>0$ solution gives the global minimum of the $f_\text{RS}$, marking a discontinuous transition to the PSB phase (solid line).
  • Figure 3: (Color online) Storage capacities of FCMs with ReLU, erf, and quadratic activations. Solid lines represent the RS storage capacities and dashed lines represent the 1-RSB storage capacities. For small $\kappa$, there is a large gap between RS and 1-RSB storage capacities, showing a strong replica symmetry-breaking effect.
  • Figure 4: (Color online) Results for ReLU. We plot the value of the loss at the end of training (left), the number of epochs to the end of training (middle), and the overlap of the hidden weights (right). The blue dot and the red square on the axis of the leftmost figure indicate the theoretical values of $\alpha_\text{PS}$ and $\alpha_\text{1-RSB}$, respectively. We observe a sudden increase in the loss at around $\alpha=2.5$ and estimate it to be the experimental storage capacity of FCM with ReLU activations. The number of training epochs until termination increases rapidly as $\alpha$ grows. The overlap between weights $c$ reaches $-1/(K-1)$ for sufficiently large $\alpha$, consistent with the replica analysis.
  • Figure 5: (Color online) Results for erf. The loss behaves differently for different input-hidden ratios $N/K$. The blue dot and the red square on the axis of the leftmost figure indicate the theoretical values of $\alpha_\text{spin}$ and $\alpha_\text{1-RSB}$, respectively. For $N/K=50$, the experimental storage capacity seems to be around 3.8 and this is well explained by our theory as stated in the main text. Since the replica analysis assumes infinite $N/K$, it fails to describe the behavior for small $N/K$. As with ReLU, the number of epochs until termination increases rapidly as $\alpha$ increases, and consistent with the theory, the order parameter $c$ converges to $-1/(K-1)$.