Unified Description of Learning Dynamics in the Soft Committee Machine from Finite to Ultra-Wide Regimes
Assem Afanah, Bernd Rosenow
TL;DR
This work analyzes the soft committee machine (SCM) with ReLU activation in a student–teacher setting using an annealed statistical-mechanics framework. By introducing aggregated order parameters $(\tilde{Q}, \tilde{R}, \tilde{r})$, the authors derive a unified description that remains valid from the conventional regime $K \ll N$ to the ultra-wide regime $K \ge N$, with the dataset size encoded in $α$ and the teacher density in $γ = M/N$. A central result is a second-order phase transition at $α_c \approx 2π$ for $γ \ll 1$, while finite $γ$ erases the sharp transition and yields a smooth decrease of the generalization error $ε_g$, which in the high-data limit scales as $ε_g \propto 1/α$ independent of $K$ and $γ$. The framework integrates known results for ReLU SCMs, demonstrates universal high-data behavior, and suggests extensions to other activations and quenched analyses, highlighting how network dimensions influence learning dynamics in shallow networks.
Abstract
We study the learning dynamics of the soft committee machine (SCM) with Rectified Linear Unit (ReLU) activation using a statistical-mechanics approach within the annealed approximation. The SCM consists of a student network with $N$ input units and $K$ hidden units trained to reproduce the output of a teacher network with $M$ hidden units. We introduce a reduced set of macroscopic order parameters that yields a unified description valid from the conventional regime $K \ll N$ to the ultra-wide limit $K \ge N$. The control parameter $α$, proportional to the ratio of training samples to adjustable weights, serves as an effective measure of dataset size. For small $γ= M/N$, we recover a continuous phase transition at $α_{c} \approx 2π$ from an unspecialized, permutation-symmetric state to a specialized state in which student units align with the teacher. For finite $γ$, the transition disappears and the generalization error decreases smoothly with dataset size, reaching a low plateau when $γ=1$. In the asymptotic limit $α\to \infty$, the error scales as $\varepsilon_{g} \propto 1/α$, independent of $γ$ and $K$. The results highlight the central role of network dimensions in SCM learning and provide a framework extendable to other activations and quenched analyses.
