Table of Contents
Fetching ...

Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics

Alessandra Carbone, Aurélien Decelle, Lorenzo Rosset, Beatriz Seoane

TL;DR

This work tackles the challenge of generating high-quality, label-specific data from highly structured datasets using energy-based models, where conventional MCMC training suffers from slow mixing. The authors introduce out-of-equilibrium training (F&F) for Restricted Boltzmann Machines, coupling two gradient terms to enable rapid, diffusion-like generation and accurate label prediction with only a small number of MCMC steps ($k=10$). Across MNIST, human-genome variations (HGD), GH30 enzyme sequences, SAM riboswitch RNAs, and Classical Music Piano Composer data, F&F-10 RBMs achieve high classification accuracy and diverse, label-consistent generation, outperforming the standard PCD-100 approach, which exhibits instability or long thermalization times on several datasets. They validate the biological relevance of generated protein sequences via structural confidence metrics (pLDDT from ESMFold) and demonstrate the method’s potential to extend to more complex energy functions, offering a practical, diffusion-like alternative for fast, reliable EBMs on structured data.

Abstract

In this study, we address the challenge of using energy-based models to produce high-quality, label-specific data in complex structured datasets, such as population genetics, RNA or protein sequences data. Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing, which affects the diversity of synthetic data and increases generation times. To address these issues, we use a novel training algorithm that exploits non-equilibrium effects. This approach, applied on the Restricted Boltzmann Machine, improves the model's ability to correctly classify samples and generate high-quality synthetic data in only a few sampling steps. The effectiveness of this method is demonstrated by its successful application to four different types of data: handwritten digits, mutations of human genomes classified by continental origin, functionally characterized sequences of an enzyme protein family, and homologous RNA sequences from specific taxonomies.

Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics

TL;DR

This work tackles the challenge of generating high-quality, label-specific data from highly structured datasets using energy-based models, where conventional MCMC training suffers from slow mixing. The authors introduce out-of-equilibrium training (F&F) for Restricted Boltzmann Machines, coupling two gradient terms to enable rapid, diffusion-like generation and accurate label prediction with only a small number of MCMC steps (). Across MNIST, human-genome variations (HGD), GH30 enzyme sequences, SAM riboswitch RNAs, and Classical Music Piano Composer data, F&F-10 RBMs achieve high classification accuracy and diverse, label-consistent generation, outperforming the standard PCD-100 approach, which exhibits instability or long thermalization times on several datasets. They validate the biological relevance of generated protein sequences via structural confidence metrics (pLDDT from ESMFold) and demonstrate the method’s potential to extend to more complex energy functions, offering a practical, diffusion-like alternative for fast, reliable EBMs on structured data.

Abstract

In this study, we address the challenge of using energy-based models to produce high-quality, label-specific data in complex structured datasets, such as population genetics, RNA or protein sequences data. Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing, which affects the diversity of synthetic data and increases generation times. To address these issues, we use a novel training algorithm that exploits non-equilibrium effects. This approach, applied on the Restricted Boltzmann Machine, improves the model's ability to correctly classify samples and generate high-quality synthetic data in only a few sampling steps. The effectiveness of this method is demonstrated by its successful application to four different types of data: handwritten digits, mutations of human genomes classified by continental origin, functionally characterized sequences of an enzyme protein family, and homologous RNA sequences from specific taxonomies.
Paper Structure (16 sections, 5 equations, 14 figures, 2 tables)

This paper contains 16 sections, 5 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: A) Our F&F-10 model performs accurate conditional generation on a variety of datasets, as shown from left to right:(MNIST) handwritten digits classified by number, (HGD) mutations in the human genome classified by continental origin, (GH30) sequences from a homologous enzyme protein family characterized by different biological functions, (SAM) a homologous family of RNA sequences classified by taxonomy and finally (CMPC), a collection of piano pieces by various composers. The generated samples are obtained by sampling the model equilibrium distribution for only 10 MCMC steps from a random initialization. The real and fake data are projected along the two principal components of the PCA of the dataset. Large dots indicate real data, while smaller contoured dots represent generated samples. Each color corresponds to a particular label. The synthetic dataset mirrors the structure of the real dataset, ensuring that each category has exactly the same number of entries. The histograms in the outer panels illustrate the distributions of the dataset (black outline) and the generated samples (violet-shaded area), projected along each of the main directions shown in the central scatter plot. Comparison of the accuracy's of B): F&F and C): PCD RBM in predicting the labels of samples in the test set as a function of the training time. The inference is done by starting from an initial random label and then performing $10^3$ MCMC steps. The purple box in the corner of the insets indicates the maximum accuracy achieved ($a_{\mathrm{max}}$), corresponding to the big purple dot.
  • Figure 2: A): Scheme of the semi-supervised RBM. B): Sketch of the sampling procedures used to calculate the two gradients during training. Left): label prediction. The visible layer is clamped to the data, while the labels are initialized randomly. The hidden layer and labels are sampled alternately using block-Gibbs sampling (green) and, after $k$ MCMC steps, the model must provide the correct labels. Right): Conditional Sampling. The labels are fixed and the visible layer is initialized randomly. The model must generate a sample corresponding to the label in $k$ MCMC steps.
  • Figure 3: Difficulties in generation with RBMs trained with the PCD-100 protocol. In both panels, we show from left to right the projection on the first two main directions of the dataset of generated samples conditioned on a given label after a different number of sampling steps $t_\mathrm{G}=10,\ 100,\ 1000,\ 10^4$ and $10^5$ respectively. As in Fig. \ref{['fig:PCAs hist Rdm']}, each point represents a sample, the labels are shown in different colors, and the synthetic data are highlighted by an outer black ring. In the lateral margins, we show the histograms of the projections along each of the two directions: in black the dataset and the colors refer to the samples generated at different sampling times $t_\mathrm{G}$. In A) we show the results for a dataset where PCD training was unstable, the GH30 dataset. Even up to $t_\mathrm{G}=10^5$, the sampling suffers from strong mode collapse. In B) we show data obtained when training the SAM dataset, where the PCD-100 training leads to a good generative model. We see that in this case, good quality samples that reproduce the statistics of the dataset are generated only after $10^5$ MCMC steps.
  • Figure 4: Comparison of the scores on the generated data between PCD-100 and F&F-10 RBMs as a function of the generation time ($t_{\mathrm{G}}$) for A) GH30 and B) SAM. All the scores are computed by comparing the test set with an identical (in terms of samples for each category) generated dataset. The samples of each category of the dataset have been compared with the corresponding samples of the synthetic data, and the curves shown in the figure represent the average scores across the different categories. The different colours of the curves represent different training times ($t_{\mathrm{age}}$), expressed in terms of gradient updates. Notice that for the PCD-RBM the generation time ranges up to $10^5$ MCMC updates, while for the F&F-RBM it only reaches $10^2$ MCMC updates. The generated samples shown in Figs. \ref{['fig:PCAs hist Rdm']} and \ref{['fig:generation vs sampling time']} correspond to the darkest blue curves in correspondence of the indicated generation time $t_{\mathrm{G}}$.
  • Figure 5: For each of the datasets considered, we show the evolution of three different quality scores as a function of sampling generation time, $t_{\mathrm{G}}$, for each label separately. The first row shows the error on the eigenvalue spectra, the second row shows the error on the entropy, and the third row shows the Adversarial Accuracy Indicator. For the GH30 and the SAM datasets, we used the training set to generate the error curves because there was too limited data in the test set to compare certain categories. The definition of the scores can be found in the SI 3.
  • ...and 9 more figures