Table of Contents
Fetching ...

Fast training and sampling of Restricted Boltzmann Machines

Nicolas Béreux, Aurélien Decelle, Cyril Furtlehner, Lorenzo Rosset, Beatriz Seoane

TL;DR

This work tackles the core bottlenecks of training and sampling equilibrium Restricted Boltzmann Machines on highly structured, multimodal datasets. It introduces a trajectory-based annealing framework (Tr-AIS) for online log-likelihood estimation and a sampling scheme (Parallel Trajectory Tempering, PTT) that exchanges configurations across models along the training path to overcome slow mixing. A low-rank RBM pretraining approach maps principal data directions into the coupling matrix via a convex optimization (RCM), mitigating early training slowdowns and improving model quality on structured data. Across diverse datasets, the proposed methods yield faster convergence, more reliable equilibrium sampling, and better log-likelihood estimates, with pretraining especially beneficial for highly clustered data.

Abstract

Restricted Boltzmann Machines (RBMs) are powerful tools for modeling complex systems and extracting insights from data, but their training is hindered by the slow mixing of Markov Chain Monte Carlo (MCMC) processes, especially with highly structured datasets. In this study, we build on recent theoretical advances in RBM training and focus on the stepwise encoding of data patterns into singular vectors of the coupling matrix, significantly reducing the cost of generating new samples and evaluating the quality of the model, as well as the training cost in highly clustered datasets. The learning process is analogous to the thermodynamic continuous phase transitions observed in ferromagnetic models, where new modes in the probability measure emerge in a continuous manner. We leverage the continuous transitions in the training process to define a smooth annealing trajectory that enables reliable and computationally efficient log-likelihood estimates. This approach enables online assessment during training and introduces a novel sampling strategy called Parallel Trajectory Tempering (PTT) that outperforms previously optimized MCMC methods. To mitigate the critical slowdown effect in the early stages of training, we propose a pre-training phase. In this phase, the principal components are encoded into a low-rank RBM through a convex optimization process, facilitating efficient static Monte Carlo sampling and accurate computation of the partition function. Our results demonstrate that this pre-training strategy allows RBMs to efficiently handle highly structured datasets where conventional methods fail. Additionally, our log-likelihood estimation outperforms computationally intensive approaches in controlled scenarios, while the PTT algorithm significantly accelerates MCMC processes compared to conventional methods.

Fast training and sampling of Restricted Boltzmann Machines

TL;DR

This work tackles the core bottlenecks of training and sampling equilibrium Restricted Boltzmann Machines on highly structured, multimodal datasets. It introduces a trajectory-based annealing framework (Tr-AIS) for online log-likelihood estimation and a sampling scheme (Parallel Trajectory Tempering, PTT) that exchanges configurations across models along the training path to overcome slow mixing. A low-rank RBM pretraining approach maps principal data directions into the coupling matrix via a convex optimization (RCM), mitigating early training slowdowns and improving model quality on structured data. Across diverse datasets, the proposed methods yield faster convergence, more reliable equilibrium sampling, and better log-likelihood estimates, with pretraining especially beneficial for highly clustered data.

Abstract

Restricted Boltzmann Machines (RBMs) are powerful tools for modeling complex systems and extracting insights from data, but their training is hindered by the slow mixing of Markov Chain Monte Carlo (MCMC) processes, especially with highly structured datasets. In this study, we build on recent theoretical advances in RBM training and focus on the stepwise encoding of data patterns into singular vectors of the coupling matrix, significantly reducing the cost of generating new samples and evaluating the quality of the model, as well as the training cost in highly clustered datasets. The learning process is analogous to the thermodynamic continuous phase transitions observed in ferromagnetic models, where new modes in the probability measure emerge in a continuous manner. We leverage the continuous transitions in the training process to define a smooth annealing trajectory that enables reliable and computationally efficient log-likelihood estimates. This approach enables online assessment during training and introduces a novel sampling strategy called Parallel Trajectory Tempering (PTT) that outperforms previously optimized MCMC methods. To mitigate the critical slowdown effect in the early stages of training, we propose a pre-training phase. In this phase, the principal components are encoded into a low-rank RBM through a convex optimization process, facilitating efficient static Monte Carlo sampling and accurate computation of the partition function. Our results demonstrate that this pre-training strategy allows RBMs to efficiently handle highly structured datasets where conventional methods fail. Additionally, our log-likelihood estimation outperforms computationally intensive approaches in controlled scenarios, while the PTT algorithm significantly accelerates MCMC processes compared to conventional methods.
Paper Structure (32 sections, 52 equations, 20 figures, 5 tables, 2 algorithms)

This paper contains 32 sections, 52 equations, 20 figures, 5 tables, 2 algorithms.

Figures (20)

  • Figure 1: Datasets. Panels A-E display 5 distinct datasets projected onto their first two PCA components. In some instances, the dots are color-coded to indicate different labels. In A, the MNIST 01 dataset, featuring images of the digits 0 and 1 from the complete MNIST collection, along with a few sample images. In B, the "Mickey" dataset, an artificial dataset whose PCA representation forms the shape of Mickey Mouse's face. In C, the Human Genome Dataset (HGD), which consists of binary vectors representing mutations or non-mutations for individuals compared to a reference genome across selected genes. In D, the Ising dataset, showcasing equilibrium configurations of the 2D ferromagnetic Ising model at an inverse temperature of $\beta=0.44$. In E, the CelebA dataset in black and white, resized to 32x32 pixels. For more details on these datasets, please refer to the SI.
  • Figure 2: Comparison of LL estimation error (relative to the exact value) across different methods: AIS (A), AIS with a reference distribution fixed to the independent site distribution matching the dataset’s empirical center (B), and Tr-AIS (ours, C). For Tr-AIS, we evaluate three approaches: online during training, offline using saved models, and with PTT. The RBM, pretrained and trained with 20 hidden nodes (allowing exact LL computation), for $10^4$ gradient steps on the HGD dataset. PTT selects a subset of saved models ensuring a 0.25 acceptance rate between consecutive models. Lines show the mean LL over 10 independent runs, with shaded areas representing one standard deviation.
  • Figure 3: Comparison of the performance of different MCMC sampling methods on RBMs trained with the pretraining+PCD procedure on the MNIST01 (row-1), the HGD (row-2) and Ising 2D datasets (row-3). In A and B columns, we show the trajectory of two independent Markov chains (red and orange) iterated for $10^4$ MCMC steps using either PTT or AGS, projected onto the first two principal components of the dataset. The position of the chains is plotted every 10 MCMC steps. The black contour represents the density profile of the dataset. In column C, we show the averaged number of jumps between the two regions separated by the dashed grey line in the PCA plots using different MCMC methods: AGS, Parallel Tempering (PT) hukushima1996exchange, without or with reference configuration (PTref) krause_algorithms_2020, the Stacked Tempering roussel2023accelerated, and the Trajectory Parallel Tempering (PTT) proposed in this work. The average is calculated over a population of 1000 chains and the shadow around the lines indicates the error of the mean.
  • Figure 4: On the left: Panel A compares the marginal distribution $p(m)$ along the first two principal components for the pretrained+PCD RBM on the HGD dataset at inverse temperatures $\beta\!=\!1$ and $\beta\!=\!0.9$. Panel B presents sampling results at $\beta\!=\!1$ using PT, PTT, and $10^6$ AGS steps, initialized from configurations generated by both algorithms. Right panels: Panels C and D compare equilibrium samples from RBMs trained with PCD alone vs. PCD initialized on low-rank RBMs for MNIST01, Mickey, HGD, Ising2D, and CelebA. Scatter plots and histograms of projections onto the first two principal components show data (black) vs. generated samples (red). Panel E tracks log-likelihood evolution (train: solid, test: dashed) using online Tr-AIS. Results, averaged over 10 low-rank RBM initializations, show minimal variance, as indicated by narrow shaded regions.
  • Figure 5: Scheme of PTT. We Initialize the chains of the models by starting from a configuration $\bm x_0^{(0)}$ and passing it through the machines along the training trajectory, each time performing $\tilde{k}$ mcmc steps. For pre-train+PCD, $\bm x_0^{(0)}$ is a sampling from the RCM, otherwise it is a uniform random initialization. The sampling consists of alternating one mcmc step for each model with a swap attempt between adjacent machines. For pre-train+PCD, at each step we sample a new independent configuration for $\mathrm{RBM}_0$ using the RCM.
  • ...and 15 more figures