Table of Contents
Fetching ...

Phase transitions in the mini-batch size for sparse and dense two-layer neural networks

Raffaele Marino, Federico Ricci-Tersenghi

TL;DR

The paper addresses how the mini-batch size $m$ governs learning dynamics and generalization in two-layer neural networks under a teacher–student framework with sparse teachers. It introduces four models and two training algorithms (greedy Metropolis updates for discrete weights and SGD for continuous weights) to reveal a phase-transition-like dependence on $m$, characterized by a critical value $m_c$ beyond which learning or perfect generalization becomes possible. The key finding is that $m_c$ scales roughly linearly with the per-layer connectivity $d$ (and with $N$ in some setups), with the transition being first-order or second-order depending on topology and output structure. This work suggests that the mini-batch size acts as a thermodynamic-like control parameter linking information-per-parameter to algorithmic feasibility, offering practical guidance for batch-size selection and insights into learning dynamics in sparse vs dense architectures.

Abstract

The use of mini-batches of data in training artificial neural networks is nowadays very common. Despite its broad usage, theories explaining quantitatively how large or small the optimal mini-batch size should be are missing. This work presents a systematic attempt at understanding the role of the mini-batch size in training two-layer neural networks. Working in the teacher-student scenario, with a sparse teacher, and focusing on tasks of different complexity, we quantify the effects of changing the mini-batch size $m$. We find that often the generalization performances of the student strongly depend on $m$ and may undergo sharp phase transitions at a critical value $m_c$, such that for $m<m_c$ the training process fails, while for $m>m_c$ the student learns perfectly or generalizes very well the teacher. Phase transitions are induced by collective phenomena firstly discovered in statistical mechanics and later observed in many fields of science. Observing a phase transition by varying the mini-batch size across different architectures raises several questions about the role of this hyperparameter in the neural network learning process.

Phase transitions in the mini-batch size for sparse and dense two-layer neural networks

TL;DR

The paper addresses how the mini-batch size governs learning dynamics and generalization in two-layer neural networks under a teacher–student framework with sparse teachers. It introduces four models and two training algorithms (greedy Metropolis updates for discrete weights and SGD for continuous weights) to reveal a phase-transition-like dependence on , characterized by a critical value beyond which learning or perfect generalization becomes possible. The key finding is that scales roughly linearly with the per-layer connectivity (and with in some setups), with the transition being first-order or second-order depending on topology and output structure. This work suggests that the mini-batch size acts as a thermodynamic-like control parameter linking information-per-parameter to algorithmic feasibility, offering practical guidance for batch-size selection and insights into learning dynamics in sparse vs dense architectures.

Abstract

The use of mini-batches of data in training artificial neural networks is nowadays very common. Despite its broad usage, theories explaining quantitatively how large or small the optimal mini-batch size should be are missing. This work presents a systematic attempt at understanding the role of the mini-batch size in training two-layer neural networks. Working in the teacher-student scenario, with a sparse teacher, and focusing on tasks of different complexity, we quantify the effects of changing the mini-batch size . We find that often the generalization performances of the student strongly depend on and may undergo sharp phase transitions at a critical value , such that for the training process fails, while for the student learns perfectly or generalizes very well the teacher. Phase transitions are induced by collective phenomena firstly discovered in statistical mechanics and later observed in many fields of science. Observing a phase transition by varying the mini-batch size across different architectures raises several questions about the role of this hyperparameter in the neural network learning process.
Paper Structure (11 sections, 8 equations, 8 figures)

This paper contains 11 sections, 8 equations, 8 figures.

Figures (8)

  • Figure 1: The figure shows the validation loss function normalized to the number of non-zero elements, averaged over $100$ samples, as a function of $m$, i.e., the mini-batch size, for different values of $d$, and different values of $N$. Error bars are standard deviation of the mean.
  • Figure 2: Left: The figure displays the averaged Hamming distance over $100$ samples as a function of the mini-batch size $m$, for different values of the parameter $d$, i.e., the number of non-zero entries in each row and column of each matrix $\mathbf{W}^*_{\text{in/out}}$. Error bars are standard deviation of the mean. Right: The figure displays the fraction of Hamming distance trajectories that have found an asymptotic value smaller than $0.25$ as a function of $m-m_c(d)$. In this case all the curves collapse one on top of the other, showing a step function.
  • Figure 3: In this figure the value of $m_c$, i.e., the critical value of the mini-batch size, as a function of $d$ is plotted. The values of $m_c(d)$ fit well a linear behaviour as a function of $d$.
  • Figure 4: The figures display the averaged validation loss normalized to the number of non-zero elements as a function of the mini-batch size $m$, for different values of $N$ and different values of $d_s$, averaged over $100$ samples (error bars are the standard deviation of the mean). In the left panel we fix $d_t=2$, while in the right panel we fix $d_t=4$.
  • Figure 5: TP and TN rates measure the fraction of correctly inferred non-zero and zero weights in the matrix $\mathbf{W}_\text{in}$. They are plotted as a function of the mini-batch size $m$ for several values of the network connectivity: $d_t=2, d_s=3$ (top left), $d_t=2, d_s=4$ (top right), $d_t=4, d_s=6$ (bottom left), $d_t=4, d_s=8$ (bottom right).
  • ...and 3 more figures