Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics
Alessandra Carbone, Aurélien Decelle, Lorenzo Rosset, Beatriz Seoane
TL;DR
This work tackles the challenge of generating high-quality, label-specific data from highly structured datasets using energy-based models, where conventional MCMC training suffers from slow mixing. The authors introduce out-of-equilibrium training (F&F) for Restricted Boltzmann Machines, coupling two gradient terms to enable rapid, diffusion-like generation and accurate label prediction with only a small number of MCMC steps ($k=10$). Across MNIST, human-genome variations (HGD), GH30 enzyme sequences, SAM riboswitch RNAs, and Classical Music Piano Composer data, F&F-10 RBMs achieve high classification accuracy and diverse, label-consistent generation, outperforming the standard PCD-100 approach, which exhibits instability or long thermalization times on several datasets. They validate the biological relevance of generated protein sequences via structural confidence metrics (pLDDT from ESMFold) and demonstrate the method’s potential to extend to more complex energy functions, offering a practical, diffusion-like alternative for fast, reliable EBMs on structured data.
Abstract
In this study, we address the challenge of using energy-based models to produce high-quality, label-specific data in complex structured datasets, such as population genetics, RNA or protein sequences data. Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing, which affects the diversity of synthetic data and increases generation times. To address these issues, we use a novel training algorithm that exploits non-equilibrium effects. This approach, applied on the Restricted Boltzmann Machine, improves the model's ability to correctly classify samples and generate high-quality synthetic data in only a few sampling steps. The effectiveness of this method is demonstrated by its successful application to four different types of data: handwritten digits, mutations of human genomes classified by continental origin, functionally characterized sequences of an enzyme protein family, and homologous RNA sequences from specific taxonomies.
