Table of Contents
Fetching ...

Continuous Specialization Transition in the Soft Committee Machine with ReLU Activation

Assem Afanah, Bernd Rosenow

Abstract

We analyze the soft committee machine with Rectified Linear Unit (ReLU) activation by means of the replica method. In a realizable teacher--student setting, we compute the quenched free energy within a replica-symmetric ansatz and obtain the typical generalization behavior from the saddle-point equations for the macroscopic order parameters. The system exhibits a transition from an unspecialized symmetric phase to a specialized phase in which the permutation symmetry among hidden units is broken. We determine the critical training-set size as a function of the inverse training temperature and derive analytic expressions both near the transition and in the asymptotic large-sample regime. Unlike the corresponding model with sigmoidal activations, which undergoes a first-order transition, the ReLU soft committee machine shows a continuous specialization transition. These results show that the activation function plays a decisive role in the phase structure and generalization behavior of multilayer networks.

Continuous Specialization Transition in the Soft Committee Machine with ReLU Activation

Abstract

We analyze the soft committee machine with Rectified Linear Unit (ReLU) activation by means of the replica method. In a realizable teacher--student setting, we compute the quenched free energy within a replica-symmetric ansatz and obtain the typical generalization behavior from the saddle-point equations for the macroscopic order parameters. The system exhibits a transition from an unspecialized symmetric phase to a specialized phase in which the permutation symmetry among hidden units is broken. We determine the critical training-set size as a function of the inverse training temperature and derive analytic expressions both near the transition and in the asymptotic large-sample regime. Unlike the corresponding model with sigmoidal activations, which undergoes a first-order transition, the ReLU soft committee machine shows a continuous specialization transition. These results show that the activation function plays a decisive role in the phase structure and generalization behavior of multilayer networks.
Paper Structure (10 sections, 65 equations, 4 figures)

This paper contains 10 sections, 65 equations, 4 figures.

Figures (4)

  • Figure 1: Schematic diagrams of the student/teacher soft committee machines. Both networks have an $N$-dimensional input layer with $K$ hidden units, we denote the student weight vectors from the input to hidden layer by $\bm{J}_{i}$ while the teacher weight vectors are denoted by $\bm{B}_{j}$; the weights from the hidden layer to the output unit are fixed to one. For a given input $\bm{\xi} \in \mathbb{R}^{N}$, the output of the SCM is proportional to the sum of the hidden-layer activations under a Rectified Linear Unit (ReLU) activation function, $g(x) = x \Theta(x)$, where $\Theta(x)$ is the Heaviside step function.
  • Figure 2: Generalization error as a function of $\alpha$ for several values of the inverse training temperature $\beta$. For each $\beta$, the system undergoes a continuous transition at $\alpha_{c}(\beta)$ from an unspecialized symmetric phase with $(\Delta=\delta=0)$, which gives a plateau at $\varepsilon_{g}=1/4-1/2\pi$, to a specialized phase with $(\Delta,\delta>0)$. The plateau becomes shorter as $\beta$ increases. The dashed curve shows the limit $\beta\rightarrow\infty$, for which the critical value approaches $\alpha_{c}\approx0.57$. The inset shows the high-temperature limit $\beta\rightarrow0$, plotted as a function of the scaled variable $\alpha\beta$. In this limit the transition occurs at $(\alpha\beta)_{c}\approx2\pi$, in agreement with the annealed approximation.
  • Figure 3: Order parameters $\Delta=R-S$ and $\delta=q-p$ as functions of $\alpha$ for $\alpha>\alpha_{c}$. Both increase monotonically with $\alpha$ and therefore measure the degree of specialization in the network. Asymptotically, $\Delta,\delta\rightarrow1$, corresponding to perfect alignment between student and teacher and hence $\varepsilon_{g}\rightarrow0$.
  • Figure 4: Order parameters $(\Delta,\delta)$ close to the transition point $\alpha_{c}$ for $\beta=5$. In panels (a) and (b), the analytic results (red dashed curves) agree closely with the numerical solutions (solid blue curves) near $\alpha_{c}$ and deviate only farther away from the transition. The log-log plots in panels (c) and (d) show the expected scaling, $\Delta\propto(\alpha-\alpha_{c})^{1/2}$ and $\delta\propto(\alpha-\alpha_{c})$, consistent with Eq. (\ref{['Eq:sol_anlyt']}).