Table of Contents
Fetching ...

The effect of priors on Learning with Restricted Boltzmann Machines

Gianluca Manzan, Daniele Tantari

TL;DR

This work analyzes learning in Restricted Boltzmann Machines under a teacher–student framework with unit priors that interpolate between Gaussian and binary distributions. Using replica-based, RS free-energy analysis, it derives a full phase diagram and identifies a triple point that fixes the minimal dataset size $\alpha_c$ needed for learning by generalization, with the data properties of the teacher driving this bound. The study shows that Gaussian priors on hidden units aid entering the signal retrieval phase, while other priors can induce memorization-like retrieval or spin-glass behavior, especially under mismatched settings. Together with Monte Carlo simulations, the results offer practical guidance on architectural choices to maximize generalization under limited data and highlight potential extensions to structured data regimes.

Abstract

Restricted Boltzmann Machines (RBMs) are generative models designed to learn from data with a rich underlying structure. In this work, we explore a teacher-student setting where a student RBM learns from examples generated by a teacher RBM, with a focus on the effect of the unit priors on learning efficiency. We consider a parametric class of priors that interpolate between continuous (Gaussian) and binary variables. This approach models various possible choices of visible units, hidden units, and weights for both the teacher and student RBMs. By analyzing the phase diagram of the posterior distribution in both the Bayes optimal and mismatched regimes, we demonstrate the existence of a triple point that defines the critical dataset size necessary for learning through generalization. The critical size is strongly influenced by the properties of the teacher, and thus the data, but is unaffected by the properties of the student RBM. Nevertheless, a prudent choice of student priors can facilitate training by expanding the so-called signal retrieval region, where the machine generalizes effectively.

The effect of priors on Learning with Restricted Boltzmann Machines

TL;DR

This work analyzes learning in Restricted Boltzmann Machines under a teacher–student framework with unit priors that interpolate between Gaussian and binary distributions. Using replica-based, RS free-energy analysis, it derives a full phase diagram and identifies a triple point that fixes the minimal dataset size needed for learning by generalization, with the data properties of the teacher driving this bound. The study shows that Gaussian priors on hidden units aid entering the signal retrieval phase, while other priors can induce memorization-like retrieval or spin-glass behavior, especially under mismatched settings. Together with Monte Carlo simulations, the results offer practical guidance on architectural choices to maximize generalization under limited data and highlight potential extensions to structured data regimes.

Abstract

Restricted Boltzmann Machines (RBMs) are generative models designed to learn from data with a rich underlying structure. In this work, we explore a teacher-student setting where a student RBM learns from examples generated by a teacher RBM, with a focus on the effect of the unit priors on learning efficiency. We consider a parametric class of priors that interpolate between continuous (Gaussian) and binary variables. This approach models various possible choices of visible units, hidden units, and weights for both the teacher and student RBMs. By analyzing the phase diagram of the posterior distribution in both the Bayes optimal and mismatched regimes, we demonstrate the existence of a triple point that defines the critical dataset size necessary for learning through generalization. The critical size is strongly influenced by the properties of the teacher, and thus the data, but is unaffected by the properties of the student RBM. Nevertheless, a prudent choice of student priors can facilitate training by expanding the so-called signal retrieval region, where the machine generalizes effectively.

Paper Structure

This paper contains 13 sections, 92 equations, 10 figures.

Figures (10)

  • Figure 1: Inverse teacher-student problem. On the left, the T-RBM generates the data, following the interactions of the planted signal $\hat{\bm{\xi}}$. On the right, a representation of the S-RBM, which tries to align its own weight vector $\bm{\xi}$ towards $\hat{\bm{\xi}}$ using information extracted from the dataset $\bm{\mathcal{S}}$.
  • Figure 2: Retrieval phase transition lines in the case of the Bayesian optimal scenario. Below the transition curve it is possible to recover the planted signal $\bm{\hat{\xi}}$. Left: P-F transition line for $\Omega_{\tau}\in\{0,0.44,1\}$. For each choice of the $\tau$ prior the perfect retrieval (PF) region starts below the star marked line, i.e. $T<\Omega_\tau$. Right: Different P-F lines (without the perfect retrieval region) for all the possible values of $\Omega_\tau$. Above each colored line (taken singularly) the student is in a paramagnetic phase, below in a ferromagnetic regime.
  • Figure 3: Triple point's coordinates as a function of the prior parameters. Left: Critical temperature for different choices of the the hidden units prior of both S-RBM and T-RBM. $T_c$ is an increasing funciton of $\Omega_\tau$, but decreases with $\Omega_{\hat{\tau}}$. Right: Critical size as function of the relevant teacher variables. It is a decreasing function of both $\Omega_{\hat{\tau}}$ and $\hat{T}=1/\hat{\beta}$.
  • Figure 4: Phase diagram of different S-RBM configurations, each with a different choice of hidden unit prior $\Omega_{\tau}$. All Students have a binary prior for $\xi$, i.e., $\Omega_\xi=0$. The teacher is generating ($\hat{\beta}=0.8$) binary data ($\Omega_s=0$), and its architecture is fixed by the choice $\Omega_{\hat{\xi}}=\Omega_{\hat{\tau}}=0$. The black star represents the critical point $(\alpha_c, T_c)$, while the vertical dashed line shows that the position $\alpha_c \simeq 1.55$ is the same for all the images. The eR phase emerges when the $\tau$ prior has a gaussian tail.
  • Figure 5: Phase diagrams showing the effect of student's hidden unit prior when the teacher generates ($\hat{\beta}=0.8$) the dataset ($\Omega_s = 0$) using a Hopfield model. The student pattern is chosen with $\Omega_\xi=0$. The critical size needed to enter the inference regime is unaltered by the choice of $\Omega_\tau$ and its value is $\alpha_c \simeq 0.06$, while the temperature $T_c$ increases.
  • ...and 5 more figures