Data augmentation using diffusion models to enhance inverse Ising inference
Yechan Lim, Sangwon Lee, Junghyo Jo
TL;DR
This work tackles the challenge of inferring Ising-model parameters from limited data by leveraging diffusion-model–based data augmentation. It combines standard maximum likelihood estimation with the erasure-machine framework and a diffusion-model generator that learns the score function to produce high-quality synthetic samples without computing partition functions. The authors demonstrate that augmenting small datasets with diffusion-generated samples can substantially improve inverse Ising inference on synthetic Sherrington–Kirkpatrick data and on real neural activity data, using energy-variance criteria to avoid overfitting. The results suggest diffusion models as a versatile tool for physics-inspired data augmentation with potential extensions to multi-state, temporal, and discrete data domains, including protein-MSAs and time-series analyses.
Abstract
Identifying model parameters from observed configurations poses a fundamental challenge in data science, especially with limited data. Recently, diffusion models have emerged as a novel paradigm in generative machine learning, capable of producing new samples that closely mimic observed data. These models learn the gradient of model probabilities, bypassing the need for cumbersome calculations of partition functions across all possible configurations. We explore whether diffusion models can enhance parameter inference by augmenting small datasets. Our findings demonstrate this potential through a synthetic task involving inverse Ising inference and a real-world application of reconstructing missing values in neural activity data. This study serves as a proof-of-concept for using diffusion models for data augmentation in physics-related problems, thereby opening new avenues in data science.
