Table of Contents
Fetching ...

Fundamental operating regimes, hyper-parameter fine-tuning and glassiness: towards an interpretable replica-theory for trained restricted Boltzmann machines

Alberto Fachechi, Elena Agliari, Miriam Aquaro, Anthony Coolen, Menno Mulder

TL;DR

This work develops a statistical-mechanics framework for a binary-visible, Gaussian-hidden Restricted Boltzmann Machine trained on noisy realizations of a single ground pattern, using the replica trick under replica symmetry to derive self-consistent order-parameter equations. It identifies critical hyperparameter regimes governed by the regularization strength β_ε, the training temperature T_2, and dataset entropy ρ_0 (via ρ_0 ≥ 0), revealing a retrieval-dominated RS phase and a subregion where replica-symmetry breaking (RSB) is expected and numerically evident. Numerical experiments corroborate RS predictions in the generative regime, while showing aging, FDT violations, and multi-cluster structures in generated samples when RS fails, indicating rich non-equilibrium glassy dynamics. The results provide a principled, interpretable map between hyperparameters and operating regimes, offering guidance for hyperparameter tuning to achieve stable sampling and highlighting directions for extending the theory beyond RS to capture high-temperature, high-load, or more complex datasets. These insights contribute to a theoretically grounded understanding of when RBMs behave as reliable generators and how their dynamics relate to underlying spin-glass-like landscapes.

Abstract

We consider restricted Boltzmann machines with a binary visible layer and a Gaussian hidden layer trained by an unlabelled dataset composed of noisy realizations of a single ground pattern. We develop a statistical mechanics framework to describe the network generative capabilities, by exploiting the replica trick and assuming self-averaging of the underlying order parameters (i.e., replica symmetry). In particular, we outline the effective control parameters (e.g., the relative number of weights to be trained, the regularization parameter), whose tuning can yield qualitatively-different operative regimes. Further, we provide analytical and numerical evidence for the existence of a sub-region in the space of the hyperparameters where replica-symmetry breaking occurs.

Fundamental operating regimes, hyper-parameter fine-tuning and glassiness: towards an interpretable replica-theory for trained restricted Boltzmann machines

TL;DR

This work develops a statistical-mechanics framework for a binary-visible, Gaussian-hidden Restricted Boltzmann Machine trained on noisy realizations of a single ground pattern, using the replica trick under replica symmetry to derive self-consistent order-parameter equations. It identifies critical hyperparameter regimes governed by the regularization strength β_ε, the training temperature T_2, and dataset entropy ρ_0 (via ρ_0 ≥ 0), revealing a retrieval-dominated RS phase and a subregion where replica-symmetry breaking (RSB) is expected and numerically evident. Numerical experiments corroborate RS predictions in the generative regime, while showing aging, FDT violations, and multi-cluster structures in generated samples when RS fails, indicating rich non-equilibrium glassy dynamics. The results provide a principled, interpretable map between hyperparameters and operating regimes, offering guidance for hyperparameter tuning to achieve stable sampling and highlighting directions for extending the theory beyond RS to capture high-temperature, high-load, or more complex datasets. These insights contribute to a theoretically grounded understanding of when RBMs behave as reliable generators and how their dynamics relate to underlying spin-glass-like landscapes.

Abstract

We consider restricted Boltzmann machines with a binary visible layer and a Gaussian hidden layer trained by an unlabelled dataset composed of noisy realizations of a single ground pattern. We develop a statistical mechanics framework to describe the network generative capabilities, by exploiting the replica trick and assuming self-averaging of the underlying order parameters (i.e., replica symmetry). In particular, we outline the effective control parameters (e.g., the relative number of weights to be trained, the regularization parameter), whose tuning can yield qualitatively-different operative regimes. Further, we provide analytical and numerical evidence for the existence of a sub-region in the space of the hyperparameters where replica-symmetry breaking occurs.
Paper Structure (25 sections, 104 equations, 12 figures, 1 table)

This paper contains 25 sections, 104 equations, 12 figures, 1 table.

Figures (12)

  • Figure 2: Numerical solutions of the self-consistency equations for $\rho_0=0$. The three plots show details of the numerical solutions for the order parameter $\tilde{m}$ of the self-consistency equations \ref{['eq:n_final_0']}-\ref{['eq:q_final_0']}. In the left plot, we reported the results for $\tilde{m}$ as a function $\alpha$ for various values of $T_2$ at $\beta_\epsilon=30$. In the plot in the center, we plotted the magnetization $\tilde{m}$ as a function of $T_2$ for various values of $\alpha$ and $\beta_\epsilon=30$, highlighting the best generating performances with square markers. In the right plot, we finally reported the magnetization at $T_2=0.1$ as a function of $\alpha$ for various values of the regularization parameter $\beta_\epsilon$.
  • Figure 3: Numerical solutions of the self-consistency equations for $\rho_0\neq0$. The three plots show details about the numerical solutions for the order parameter $\tilde{m}$ of the self-consistency equations \ref{['eq:n_final_0']}-\ref{['eq:q_final_0']}. In left plot, we reported the results for $\tilde{m}$ as a function $\alpha$ at $T_2 =0.05$ for various values of $\rho_0$ at $\beta_\epsilon=30$. The dashed lines corresponds to the upper bound given by $\tilde{m}=r$. In the plot in the center, we plotted the magnetization $\tilde{m}$ as a function of $T_2$ for various values of $\alpha$ at $\rho_0=0.1$ and $\beta_\epsilon=30$. In the right plot, we finally reported the magnetizations a function of $\rho_0$ for various values of the parametrization load $\alpha$ at $T_2=0.1$ and $\beta_\epsilon=30$.
  • Figure 4: Response diagram of the RBM. The panels show the response of the RBM in terms of the final overlap $\tilde{m}$ of the visible layer with the pattern $\boldsymbol{\xi}$ in the $(\alpha,T_2)$ plane for various values of $\beta_\epsilon$ and $\rho_0$. The colour map illustrates the values taken by the magnetisation $\tilde{m}$ (the legend is reported on the right) and the red curve coincides with the 1-dimensional subspace at which the maximum magnetization $\tilde{m}$ is achieved. The panels are organized as follows: from the left to the right, we increase the dataset entropy $\rho_0$ ($0,0.1,0.2$), while from the top to the bottom we increase the regularization parameter $\beta_\epsilon$ ($=5,10,30$). We stress that Eq. \ref{['eq:rs_break']} provides the temperature of RS retrieval solution breaking at $T_2\approx 0.37$ for $\rho=0.1$ and $\beta_\epsilon=30$, $T_2= 0.6$ for $\rho=0.2$ and $\beta_\epsilon=10$, $T_2= 0.2$ for $\rho=0.2$ and $\beta_\epsilon=30$, in perfect agreement with the numerical solution of the self-consistency equations resumed in the color maps.
  • Figure 5: Comparison between retrieval solutions below and above the regularization threshold. The two plots show the behavior of the retrieval solution (in terms of the feature magnetization $\tilde{m}$ as a function of the relative width of the hidden layer $\alpha$) below ($\beta_\epsilon =0.5$) and above ($\beta _\epsilon =2$) the regularization threshold at $\rho_0 = 0$ and for various values of the temperature $T_2 <1$.
  • Figure 6: Training performances of the binary-Gaussian RBMs. The figure shows the evolution of the likelihood function during the training procedure for various values of the external parameters. First row: we fix $\beta_\epsilon=30$ and $\rho_0=0.2$ and vary $\alpha=0.2,0.7,2.2$ (resp. left, middle and right plots) and $T_2=0.02,0.2,0.3$ (resp. blue, orange and green curves). Second row: we fix $T_2 =0.3$ and $\alpha=0.25$ and vary $\rho_0=0,0.1,0.2$ (resp. left, middle and right plots) and $\beta_\epsilon =2,5,30$ (resp. blue, orange and green curves). The curves are the average over 100 different realization of the training procedure for fixed external parameters, and the associated filled regions is the inter-quartile range at each training epoch.
  • ...and 7 more figures