Table of Contents
Fetching ...

The committee machine: Computational to statistical gaps in learning a two-layers neural network

Benjamin Aubin, Antoine Maillard, Jean Barbier, Florent Krzakala, Nicolas Macris, Lenka Zdeborová

TL;DR

This work provides a rigorous foundation for replica-based predictions in the two-layer committee machine by deriving a replica-symmetric free entropy via adaptive interpolation and linking it to Bayes-optimal generalization. It introduces an AMP algorithm with state evolution that achieves the Bayes-optimal performance over a broad parameter range, and reveals a substantial computational gap in regimes where information-theoretic generalization is possible but polynomial algorithms fail. The analysis uncovers a specialization phase transition as the number of hidden units grows, with distinct behavior for Gaussian vs binary weights and for large K, indicating a rich landscape of algorithmic hardness in multi-layer networks. Overall, the paper connects statistical physics insights to provable asymptotics and practical inference algorithms, highlighting both potential and limits of efficient learning in two-layer architectures and outlining directions for extending to deeper models.

Abstract

Heuristic tools from statistical physics have been used in the past to locate the phase transitions and compute the optimal learning and generalization errors in the teacher-student scenario in multi-layer neural networks. In this contribution, we provide a rigorous justification of these approaches for a two-layers neural network model called the committee machine. We also introduce a version of the approximate message passing (AMP) algorithm for the committee machine that allows to perform optimal learning in polynomial time for a large set of parameters. We find that there are regimes in which a low generalization error is information-theoretically achievable while the AMP algorithm fails to deliver it, strongly suggesting that no efficient algorithm exists for those cases, and unveiling a large computational gap.

The committee machine: Computational to statistical gaps in learning a two-layers neural network

TL;DR

This work provides a rigorous foundation for replica-based predictions in the two-layer committee machine by deriving a replica-symmetric free entropy via adaptive interpolation and linking it to Bayes-optimal generalization. It introduces an AMP algorithm with state evolution that achieves the Bayes-optimal performance over a broad parameter range, and reveals a substantial computational gap in regimes where information-theoretic generalization is possible but polynomial algorithms fail. The analysis uncovers a specialization phase transition as the number of hidden units grows, with distinct behavior for Gaussian vs binary weights and for large K, indicating a rich landscape of algorithmic hardness in multi-layer networks. Overall, the paper connects statistical physics insights to provable asymptotics and practical inference algorithms, highlighting both potential and limits of efficient learning in two-layer architectures and outlining directions for extending to deeper models.

Abstract

Heuristic tools from statistical physics have been used in the past to locate the phase transitions and compute the optimal learning and generalization errors in the teacher-student scenario in multi-layer neural networks. In this contribution, we provide a rigorous justification of these approaches for a two-layers neural network model called the committee machine. We also introduce a version of the approximate message passing (AMP) algorithm for the committee machine that allows to perform optimal learning in polynomial time for a large set of parameters. We find that there are regimes in which a low generalization error is information-theoretically achievable while the AMP algorithm fails to deliver it, strongly suggesting that no efficient algorithm exists for those cases, and unveiling a large computational gap.

Paper Structure

This paper contains 63 sections, 14 theorems, 191 equations, 5 figures, 2 algorithms.

Key Result

Theorem 3.1

Suppose hyp:1, hyp:2 and hyp:3, and Assumption assumptionSince the publication of this work the adaptive interpolation method used in this paper has been improved for finite-rank models and can now circumvent this artificial hypothesis, see barbier2020information and reeves2020information.. Then for

Figures (5)

  • Figure 1: The committee machine is one of the simplest models belonging to the considered model class \ref{['model']}, and on which we focus to illustrate our results. It is a two-layers neural network with activation sign functions $f^{(1)},f^{(2)}=\text{ sign}$ and weights $W^{(2)}$ fixed to unity. It is represented for $K=2$.
  • Figure 2: Generalization error and order parameter for a committee machine with two hidden neurons ($K=2$) with Gaussian weights (left), binary/Rademacher weights (right). These are shown as a function of the ratio $\alpha=m/n$ between the number of samples $m$ and the dimensionality $n$. Lines are obtained from the state evolution (SE) equations (dominating solution is shown in full line), data-points from the AMP algorithm averaged over 10 instances of the problem of size $n=10^4$. $q_{00}$ and $q_{01}$ denote diagonal and off-diagonal overlaps, and their values are given by the labels on the far-right of the figure.
  • Figure 3: (Left) Bayes optimal and AMP generalization errors and (right) diagonal and off-diagonal overlaps $q_{00}$ and $q_{01}$ for a committee machine with a large number of hidden neurons $K$ and Gaussian weights, as a function of the rescaled parameter $\tilde{\alpha}=\alpha/K$. Curves shown correspond to the value $K = 10$. Solutions corresponding to global and local minima of the replica free entropy are respectively represented with full and dashed lines. The dotted line marks the spinodal at $\widetilde{\alpha}^G_{\rm spinodal}\simeq 7.17$, i.e. the apparition of a local minimum in the replica free entropy, associated to a solution with specialized hidden units. The dotted-dashed line shows the first order specialization transition at $\widetilde{\alpha}^G_{\rm spec} \simeq 7.65$, at which the specialized fixed point becomes the global minimum. For $\widetilde{\alpha} < \widetilde{\alpha}^G_{\rm spec}$, AMP reaches the Bayes-optimal generalization error and overlaps, corresponding to a non-specialized solution with $q_{00} = q_{01}$. However, for $\widetilde{\alpha} > \widetilde{\alpha}^G_{\rm spec}$, the AMP algorithm does not follow the optimal specialized solution and is stuck in the non-specialized solution plateau, represented with dashed lines (in particular $q_{00}^{\mathrm{AMP}} = q_{01}^{\mathrm{AMP}} \simeq 1/K$ at large $\widetilde{\alpha}$). Hence, it unveils a large computational gap (yellow area). We finally emphasize that the initial descent of the generalization error of the non-specialized solution to a plateau occurs for finite $\alpha$ as $K \to \infty$ (i.e. for $\widetilde{\alpha}$ going to $0$). On the other hand, the $K \to \infty$ limit of the transition points $(\widetilde{\alpha}^G_{\rm spec},\widetilde{\alpha}^G_{\rm spinodal})$, as well the generalization error values for all finite $\widetilde{\alpha}$, are found to be very well approximated by their values for $K = 10$.
  • Figure 4: Factor graph representation of the committee machine (for $n=4$ and $m=3$). The variable (circle) $W_i \in \mathbb{R}^{K}$ needs to satisfy a prior constraint (square) $P_0$ and a constraint accounting for the fully connected layer, that correlates all the variables together.
  • Figure 5: Similar plot as in Fig. \ref{['fig:phaseDiagramK2']} but for the parity machine with two hidden neurons. Value of the order parameter and the optimal generalization error for a parity machine with two hidden neurons with Gaussian weights (left) and binary/Rademacher weights (right). SE and AMP overlaps are respectively represented in full line and points.

Theorems & Definitions (26)

  • Theorem 3.1: Replica formula
  • Remark 3.2: Relaxing the hypotheses
  • Lemma 5.1: Perturbation of the free entropy
  • proof
  • Proposition 5.2: Free entropy variation
  • Proposition 5.3: Overlap concentration
  • Proposition 5.4: Fundamental sum rule
  • proof
  • Lemma 5.5
  • proof : Proof of Lemma \ref{['lemma:positivity_trace_jac']}
  • ...and 16 more