Cascade of phase transitions in the training of Energy-based models

Dimitrios Bachtis; Giulio Biroli; Aurélien Decelle; Beatriz Seoane

Cascade of phase transitions in the training of Energy-based models

Dimitrios Bachtis, Giulio Biroli, Aurélien Decelle, Beatriz Seoane

TL;DR

This work analyzes how energy-based models, specifically RBMs, learn data distributions by revealing a cascade of second-order phase transitions in the weight-spectrum as training progresses. Using a tractable BG-RBM analytic framework and real-data validations (HGD, MNIST, CelebA), the authors show that learning proceeds through successive refinements of principal data modes, with weight directions aligning to PCA components and later encoding finer structures. A mean-field finite-size scaling hypothesis is proposed and tested, indicating universal critical behavior (e.g., $\gamma=1$) at the first transition, and they observe divergent MCMC mixing times and hysteresis at the transitions. The results offer a mechanistic view of feature encoding in generative models and have practical implications for training and sampling efficiency, with potential extensions to deeper EBMs and diffusion-like models.

Abstract

In this paper, we investigate the feature encoding process in a prototypical energy-based generative model, the Restricted Boltzmann Machine (RBM). We start with an analytical investigation using simplified architectures and data structures, and end with numerical analysis of real trainings on real datasets. Our study tracks the evolution of the model's weight matrix through its singular value decomposition, revealing a series of phase transitions associated to a progressive learning of the principal modes of the empirical probability distribution. The model first learns the center of mass of the modes and then progressively resolve all modes through a cascade of phase transitions. We first describe this process analytically in a controlled setup that allows us to study analytically the training dynamics. We then validate our theoretical results by training the Bernoulli-Bernoulli RBM on real data sets. By using data sets of increasing dimension, we show that learning indeed leads to sharp phase transitions in the high-dimensional limit. Moreover, we propose and test a mean-field finite-size scaling hypothesis. This shows that the first phase transition is in the same universality class of the one we studied analytically, and which is reminiscent of the mean-field paramagnetic-to-ferromagnetic phase transition.

Cascade of phase transitions in the training of Energy-based models

TL;DR

) at the first transition, and they observe divergent MCMC mixing times and hysteresis at the transitions. The results offer a mechanistic view of feature encoding in generative models and have practical implications for training and sampling efficiency, with potential extensions to deeper EBMs and diffusion-like models.

Abstract

Paper Structure (21 sections, 50 equations, 7 figures, 1 table)

This paper contains 21 sections, 50 equations, 7 figures, 1 table.

Introduction
Related work
Definition of the model
Theory of learning dynamics for simplified high-dimensional models of data
Learning two features through a phase transition
Learning multiple features though a cascade of phase transitions
Numerical Analysis
Conclusions
acknowledgments
Binary-Gauss RBM
Binary-Binary RBM
Learning with correlated patterns
The datasets and the rescaling
Details on the training and the numerical analysis
Training
...and 6 more sections

Figures (7)

Figure 1: Learning behavior of the BG-RBM with one hidden node, using data from the Mattis model at different inverse temperatures, system sizes and learning rates $\beta,N_\mathrm{v},\epsilon$. The argument of the exponential curves is set to $m^2 \epsilon N_\mathrm{v}$, where $\epsilon$ is the learning rate. Inset: (top) behavior of the susceptibility $\chi$ (bottom) magnetization $h^*$ of the learning RBM. The vertical line marks the point at which the susceptibility diverges, indicating the onset of spontaneous magnetization. Right: Learning curves for RBMs learning two correlated patterns. The dashed curves represent the weights of the two hidden nodes projected onto $\bm{\xi}^1+\bm{\xi}^2$, while the dashed-dotted curves are projected onto $\bm{\xi}^1-\bm{\xi}^2$. Inset: Exponential growth during the two phases: top shows growth in the direction $\bm{\xi}^1+\bm{\xi}^2$ at a rate $r^2(1+\kappa)/2$, and bottom shows growth in the direction $\bm{\xi}^1-\bm{\xi}^2$ at a rate $p^2(1-\kappa)/2$. The arguments of the exponentials are not adjusted.
Figure 2: Human genome dataset. Progressive coding of the main directions of the dataset when training an RBM with the human genome dataset 10002015global. In A, we show the dataset projected along the first two principal components of the dataset $\bm \eta_\alpha$ with $\alpha=1,2$, and $m_\alpha^\text{PCA}=\bm{\eta}_\alpha\cdot \bm x^{(d)}/\sqrt{N_\mathrm{v}}$, with $\bm x^{(d)}$ referring to the different entries in the dataset, i.e. an human individual. Points are colored according to the individual continental origin. In B, we show the evolution of the singular values $w_\alpha$ of the RBM weight matrix $\bm{W}$ as a function of the number of training epochs, and in C, we show the scalar product of the corresponding singular vectors $\bm u_\alpha$ with the corresponding PCA component $\bm \eta_\alpha$. In D, we show the magnetization of the samples generated by the model at different epochs, projected along the first two eigenvectors of $\bm{W}$, which shows that the specialization of the model occurs through the progressive encoding of the main modes of the data in $\bm{W}$.
Figure 3: Traning with the MNIST dataset. In A we show the evolution of the singular values of the RBM's coupling matrix $\bm{W}$ as a function of the training time. In $B$ we show the evolution of the susceptibilities associated with the magnetizations along the right singular vectors of $\bm{W}$, $m_\alpha=\left\langle \bm u_\alpha\cdot \bm v\right\rangle/N_\mathrm{v}$. In both figures, we consider the standard $N_\mathrm{v}=28^2$ MNIST dataset, different colors refer to different modes. In C we show the susceptibility associated with the overlaps $q$ and $\bar{q}$ between visible and hidden variables. In D we show the susceptibility of the first mode as a function of the first singular value $w_1$ obtained with trainings on MNIST data scaled to different system sizes above and below $L=28$. The numerical curves are compared with the theoretical expectation using the Mattis model in Eq. \ref{['eq:Mattischi']} using $w_{1,\mathrm{c}}=4.45$. The same data are shown in E, scaled using the mean-field finite-size scaling ansatz of Eq. \ref{['eq:FSSchi']}. In F, we show the first 10 modes' susceptibilities $\chi_{m_\alpha}$ as a function of their corresponding singular value $w_\alpha$ and compare them with the theoretical curve in D. In G, we show the MCMC relaxation time of the machines trained with different $N_\mathrm{v}$ datasets as a function of $w_1$, together with the theoretical expectation for local moves in dashed lines.
Figure 4: Training with the CELEBA and HGD datasets: In A, we plot the hidden susceptibility for different system sizes in the CELEBA dataset, with dashed lines indicating the expected divergence at $w_{1,c} = 4$. In B, we show the mean-field FFS associated with the first transition using mean-field exponents. In C and D, we present the visible susceptibility for the first phase transition in the HGD dataset, using $w_{1,c} = 5.25$ for scaling. In E, typical hysteresis in the low-temperature phase is illustrated for CELEBA (128$\times$128), similar to the mean-field Ising model in external fields.
Figure 5: Left: learning behavior of the Binary-Binary RBM, using data from the Mattis model. The different curves correspond to systems of size $N_\mathrm{v}=900$ at inverse temperature $\beta=1.4$ with learning rate $\epsilon=0.03,0.04,0.05$ and $N_\mathrm{h} =400,700,1000$ respectively. The argument of the exponential curves are not adjusted but set to $m^2 \epsilon / \alpha$. Right: we illustrate the RBM's dynamics in the binary-binary case with $\beta=1.4$ and $N_\mathrm{v}=900$, $N_\mathrm{h}=400$. First the eigenvector $\bm{u}^{\alpha=1}$ aligns itself with the pattern $\bm{\xi}$. Then, the eigenvalue $w_{\alpha=1}$ grows exponentially until reaching saturation and when it crosses the value $1$, the system develops a spontaneous magnetization.
...and 2 more figures

Cascade of phase transitions in the training of Energy-based models

TL;DR

Abstract

Cascade of phase transitions in the training of Energy-based models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)