Table of Contents
Fetching ...

Data augmentation using diffusion models to enhance inverse Ising inference

Yechan Lim, Sangwon Lee, Junghyo Jo

TL;DR

This work tackles the challenge of inferring Ising-model parameters from limited data by leveraging diffusion-model–based data augmentation. It combines standard maximum likelihood estimation with the erasure-machine framework and a diffusion-model generator that learns the score function to produce high-quality synthetic samples without computing partition functions. The authors demonstrate that augmenting small datasets with diffusion-generated samples can substantially improve inverse Ising inference on synthetic Sherrington–Kirkpatrick data and on real neural activity data, using energy-variance criteria to avoid overfitting. The results suggest diffusion models as a versatile tool for physics-inspired data augmentation with potential extensions to multi-state, temporal, and discrete data domains, including protein-MSAs and time-series analyses.

Abstract

Identifying model parameters from observed configurations poses a fundamental challenge in data science, especially with limited data. Recently, diffusion models have emerged as a novel paradigm in generative machine learning, capable of producing new samples that closely mimic observed data. These models learn the gradient of model probabilities, bypassing the need for cumbersome calculations of partition functions across all possible configurations. We explore whether diffusion models can enhance parameter inference by augmenting small datasets. Our findings demonstrate this potential through a synthetic task involving inverse Ising inference and a real-world application of reconstructing missing values in neural activity data. This study serves as a proof-of-concept for using diffusion models for data augmentation in physics-related problems, thereby opening new avenues in data science.

Data augmentation using diffusion models to enhance inverse Ising inference

TL;DR

This work tackles the challenge of inferring Ising-model parameters from limited data by leveraging diffusion-model–based data augmentation. It combines standard maximum likelihood estimation with the erasure-machine framework and a diffusion-model generator that learns the score function to produce high-quality synthetic samples without computing partition functions. The authors demonstrate that augmenting small datasets with diffusion-generated samples can substantially improve inverse Ising inference on synthetic Sherrington–Kirkpatrick data and on real neural activity data, using energy-variance criteria to avoid overfitting. The results suggest diffusion models as a versatile tool for physics-inspired data augmentation with potential extensions to multi-state, temporal, and discrete data domains, including protein-MSAs and time-series analyses.

Abstract

Identifying model parameters from observed configurations poses a fundamental challenge in data science, especially with limited data. Recently, diffusion models have emerged as a novel paradigm in generative machine learning, capable of producing new samples that closely mimic observed data. These models learn the gradient of model probabilities, bypassing the need for cumbersome calculations of partition functions across all possible configurations. We explore whether diffusion models can enhance parameter inference by augmenting small datasets. Our findings demonstrate this potential through a synthetic task involving inverse Ising inference and a real-world application of reconstructing missing values in neural activity data. This study serves as a proof-of-concept for using diffusion models for data augmentation in physics-related problems, thereby opening new avenues in data science.

Paper Structure

This paper contains 10 sections, 18 equations, 6 figures.

Figures (6)

  • Figure 1: (Color Online) Schematic diagram of diffusion models. Diffusion models represent a scalar probability function (colored contours) in terms of vector flows (gray arrows). The direction of the vector flows is opposite to the noise vector $\epsilon_t$, which transports the previous $x_{t-1}$ to the present $x_t$. Noise schedule $\alpha_t$ controls the degree of diffusion.
  • Figure 2: (Color Online) Inverse Ising inference with augmented data. (a) Heatmap of interaction parameters, sampled from a normal distribution $\mathcal{N}(0, g^{2}/n)$ with $g=1$, in the Sherrington-Kirkpatrick model in dimensions $n=40$. Inferred parameter values versus true parameter values are compared for the inference results using only $M=4,000$ observed data (filled black circles) and using $M^+=100,000$ augmented data (orange crosses) generated by the diffusion model after (b) 1,000, (c) 32,000, and (d) 1,000,000 iterations. (e) The inference performance, measured by the mean square error (MSE) between inferred and true parameter values, varies depending on the degree of learning of the diffusion model. The MSE from the inference using only observed data is shown as a reference by the dashed line. (f) Variances of the energies of observed binary patterns in the training and test sets, as well as the energies of the augmented data, are measured and compared at different learning stages of the diffusion model, with all energies measured using the $\{b_i, w_{ij}\}$ values inferred solely from the training data. Dashed and dotted lines represent the reference energy variances for the observed data in the training and test sets, respectively.
  • Figure 3: Inference with different strengths of bias and interaction parameters. The mean square error (MSE) of the inferred parameter values was measured based on the strengths of the bias and interaction parameters, which were chosen from a normal distribution $\mathcal{N}(0, g^{2}/n)$ with system dimension $n$. Inference was performed using only $M=4,000$ observed data (obs; empty circles) and $M^+=100,000$ augmented data (aug; filled circles). The standard deviations of the MSE from 10 ensembles are too small to be visible at the scale of the symbols.
  • Figure 4: (Color Online) Inferring performance with the size of observed and augmented data. (a-c) Mean square error (MSE) of inferred parameter values as a function of the size $M$ of observed data. After training the diffusion model with observed data, the model augmented the data to $M^+=300,000$. The inference error was compared for using only the observed data (empty circles and dashed line) and using the augmented data (filled circles and solid line). All true parameters were chosen to follow a normal distribution $\mathcal{N}(0, g^{2}/n)$ with $g=1$. (d-f) Inference performance depending on the size $M^+$ of augmented data. The diffusion model for augmenting data was trained with $M=4,000$ observed data (d, e) and $M=8,000$ observed data (f). The horizontal dashed line is the reference MSE of the inference using only the observed data (obs). The inference using only the augmented data (aug; filled black circles) and using both observed and augmented data (aug+obs; empty orange circles) is shown. The dimensions for the Sherrington-Kirkpatrick model are $n=20$ (a, d), $n=40$ (b, e), and $n=60$ (c, f). The standard deviations of the MSE from 10 ensembles are too small to be visible at the scale of the symbols.
  • Figure 5: (Color Online) Inferring the performance of various generative models. (a) True parameter values are inferred using augmented data from a variational autoencoder (VAE, blue triangles), a restricted Boltzmann machine (RBM, green squares), and a diffusion model (Diff, orange crosses). For comparison, inference results using only the observed data are shown as filled black circles. The generative models are trained with $M = 2,000$ observed data, and parameter inference is performed using $M^+=100,000$ augmented data. (b) The mean square error (MSE) of the inferred parameter values as a function of the size $M$ of observed data. The standard deviations of the MSE from 10 ensembles are too small to be visible at the scale of the symbols. The dimension for the Sherrington-Kirkpatrick model is $n=40$. All true parameters are drawn from a normal distribution $\mathcal{N}(0, g^{2}/n)$ with $g=1$.
  • ...and 1 more figures