Table of Contents
Fetching ...

Unified Description of Learning Dynamics in the Soft Committee Machine from Finite to Ultra-Wide Regimes

Assem Afanah, Bernd Rosenow

TL;DR

This work analyzes the soft committee machine (SCM) with ReLU activation in a student–teacher setting using an annealed statistical-mechanics framework. By introducing aggregated order parameters $(\tilde{Q}, \tilde{R}, \tilde{r})$, the authors derive a unified description that remains valid from the conventional regime $K \ll N$ to the ultra-wide regime $K \ge N$, with the dataset size encoded in $α$ and the teacher density in $γ = M/N$. A central result is a second-order phase transition at $α_c \approx 2π$ for $γ \ll 1$, while finite $γ$ erases the sharp transition and yields a smooth decrease of the generalization error $ε_g$, which in the high-data limit scales as $ε_g \propto 1/α$ independent of $K$ and $γ$. The framework integrates known results for ReLU SCMs, demonstrates universal high-data behavior, and suggests extensions to other activations and quenched analyses, highlighting how network dimensions influence learning dynamics in shallow networks.

Abstract

We study the learning dynamics of the soft committee machine (SCM) with Rectified Linear Unit (ReLU) activation using a statistical-mechanics approach within the annealed approximation. The SCM consists of a student network with $N$ input units and $K$ hidden units trained to reproduce the output of a teacher network with $M$ hidden units. We introduce a reduced set of macroscopic order parameters that yields a unified description valid from the conventional regime $K \ll N$ to the ultra-wide limit $K \ge N$. The control parameter $α$, proportional to the ratio of training samples to adjustable weights, serves as an effective measure of dataset size. For small $γ= M/N$, we recover a continuous phase transition at $α_{c} \approx 2π$ from an unspecialized, permutation-symmetric state to a specialized state in which student units align with the teacher. For finite $γ$, the transition disappears and the generalization error decreases smoothly with dataset size, reaching a low plateau when $γ=1$. In the asymptotic limit $α\to \infty$, the error scales as $\varepsilon_{g} \propto 1/α$, independent of $γ$ and $K$. The results highlight the central role of network dimensions in SCM learning and provide a framework extendable to other activations and quenched analyses.

Unified Description of Learning Dynamics in the Soft Committee Machine from Finite to Ultra-Wide Regimes

TL;DR

This work analyzes the soft committee machine (SCM) with ReLU activation in a student–teacher setting using an annealed statistical-mechanics framework. By introducing aggregated order parameters , the authors derive a unified description that remains valid from the conventional regime to the ultra-wide regime , with the dataset size encoded in and the teacher density in . A central result is a second-order phase transition at for , while finite erases the sharp transition and yields a smooth decrease of the generalization error , which in the high-data limit scales as independent of and . The framework integrates known results for ReLU SCMs, demonstrates universal high-data behavior, and suggests extensions to other activations and quenched analyses, highlighting how network dimensions influence learning dynamics in shallow networks.

Abstract

We study the learning dynamics of the soft committee machine (SCM) with Rectified Linear Unit (ReLU) activation using a statistical-mechanics approach within the annealed approximation. The SCM consists of a student network with input units and hidden units trained to reproduce the output of a teacher network with hidden units. We introduce a reduced set of macroscopic order parameters that yields a unified description valid from the conventional regime to the ultra-wide limit . The control parameter , proportional to the ratio of training samples to adjustable weights, serves as an effective measure of dataset size. For small , we recover a continuous phase transition at from an unspecialized, permutation-symmetric state to a specialized state in which student units align with the teacher. For finite , the transition disappears and the generalization error decreases smoothly with dataset size, reaching a low plateau when . In the asymptotic limit , the error scales as , independent of and . The results highlight the central role of network dimensions in SCM learning and provide a framework extendable to other activations and quenched analyses.

Paper Structure

This paper contains 14 sections, 72 equations, 6 figures.

Figures (6)

  • Figure 1: Schematic diagrams of the student and teacher soft committee machines. Both networks receive an $N$-dimensional input and contain $M$ (teacher) or $K$ (student) hidden units. The corresponding input-hidden weight vectors are denoted by ${\bm{B}}_{j}$ for the teacher and ${\bm{J}}_{i}$ for the student; the input-hidden weight vectors are normalized to one. For a given input ${\bm{\xi}} \in \mathbb{R}^{N}$, the outputs of the teacher, $\tau({\bm{\xi}})$, and of the student, $\sigma({\bm{\xi}})$, are proportional to the sum of hidden-unit activations under a Rectified Linear Unit (ReLU) activation, $g(x) = x \Theta(x)$, where $\Theta(x)$ is the Heaviside step function.
  • Figure 2: Learning curves obtained by numerically minimizing the free energy, Eq. (\ref{['Eq:free1']}), for $(N = 10^{12}, M = 10^{6})$ and various ratios $M/K$. The unrealizable ($M/K=2$), realizable ($M/K=1$), over-realizable ($M/K=0.5$), and ultra-wide ($K \geq N$) regimes all display a phase transition from an unspecialized to a specialized phase near $\alpha_{c} \approx 2\pi$. The qualitative form of the learning curves remains similar across these regimes; only the height of the symmetric plateau decreases with increasing $K/M$. Once $K = N$, the plateau height saturates and no further decrease is observed.
  • Figure 3: Evolution of the learning curves as $M/N$ varies in the realizable case $K=M$ with $N=10^{12}$. For $M/N \ll 1$ (e.g., $M/N = 10^{-11}$ or $10^{-6}$), a phase transition occurs at $\alpha_{c} \approx 2\pi$. For large networks with $K(M) \gg 1$ hidden units (blue curve, $K = 10^{6}$), a well-defined symmetric plateau develops. When $10^{-3} \lesssim M/N \lesssim 1$, the generalization error decreases smoothly with $\alpha$, and no sharp transition is observed (shown for $M/N = 0.1$ and $0.5$). At $M/N = 1$, the generalization error immediately reaches a low, $\alpha$-independent plateau.
  • Figure 4: (a) Generalization error $\varepsilon_{g}(\tilde{Q},\tilde{R},\tilde{r})$ vs. dataset size $\alpha$ for the realizable case $(K=M)$ with $(\gamma=10^{-11},K=10)$, obtained from minimizing Eq. (\ref{['Eq:free_KeM']}). We compare $\varepsilon_{g}(\tilde{Q},\tilde{R},\tilde{r})$ to $\varepsilon_{g}(C,R,S)$ reproduced from Oostwal, of the generalization behavior of a ReLU-based SCM. The two formalisms agree in the unspecialized phase ($\alpha<\alpha_{c}$) and near the phase boundary $\alpha_{c}\approx2\pi$, but differ deeper in the specialized phase ($\alpha>\alpha_{c}$) due to our expansion in Eq. (\ref{['Eq:Eg_expnd']}). (b) Evolution of $\tilde{R}$ with $\alpha$: it grows smoothly in the unspecialized phase and then rapidly approaches $1$ beyond $\alpha_{c}$ (inset). (c) $\tilde{Q}$ decreases at small $\alpha$, then rises to a peak at $\alpha_{c}$, signaling the phase transition. (d) For $\alpha<\alpha_{c}$, $\tilde{r}\sim\mathcal{O}(1/K)$ (consistent with committee symmetric $R_{ij}$); for $\alpha>\alpha_{c}$, specialization begins and $\tilde{r}\approx 1-2\pi/\alpha$.
  • Figure 5: (a) Generalization error for the realizable case $(K=M)$ with $(\gamma=10^{-6},K=10^{6})$, obtained by minimizing Eq. (\ref{['Eq:free_KeM']}) and compared against results from Oostwal in the $K\to\infty$ limit. Excellent agreement is observed for $\alpha<\alpha_{c}\approx2\pi$, while deviations appear deeper in the specialized phase ($\alpha>\alpha_{c}$) due to our expansion in order parameters. (b) $\tilde{R}$ remains $1$ for all $\alpha>0$. (c) $\tilde{Q}\approx 1$ in the unspecialized phase and near the transition, with a small correction $\mathcal{O}(1/K)$ (inset). (d) $\tilde{r}\sim\mathcal{O}(1/K)$ for $\alpha<\alpha_{c}$, then grows to $\tilde{r}\approx 1-2\pi/\alpha$ beyond $\alpha_{c}$.
  • ...and 1 more figures