Table of Contents
Fetching ...

Combined Representation and Generation with Diffusive State Predictive Information Bottleneck

Richard John, Yunrui Qiu, Lukas Herron, Pratyush Tiwary

TL;DR

The paper tackles the challenge of learning informative representations and generating samples from high‑dimensional molecular distributions under limited data and varying thermodynamic conditions. It introduces Diffusive State Predictive Information Bottleneck (D‑SPIB), which fuses a time‑lagged SPIB representation with a score‑based diffusion prior learned via a VP‑SDE and tempered by temperature embeddings. The resulting objective, $\mathcal{L}_{\mathrm{D-SPIB}}$, blends prediction accuracy with a diffusion‑based regularizer, enabling joint representation learning and generation that captures thermodynamic dependencies and extrapolates to unseen temperatures. Empirically, D‑SPIB outperforms vanilla SPIB on a three‑hole analytical potential and accurately interpolates temperature‑dependent structure in a 2D LJ7 system, demonstrating data‑efficient thermodynamic modeling with generative capabilities.

Abstract

Generative modeling becomes increasingly data-intensive in high-dimensional spaces. In molecular science, where data collection is expensive and important events are rare, compression to lower-dimensional manifolds is especially important for various downstream tasks, including generation. We combine a time-lagged information bottleneck designed to characterize molecular important representations and a diffusion model in one joint training objective. The resulting protocol, which we term Diffusive State Predictive Information Bottleneck (D-SPIB), enables the balancing of representation learning and generation aims in one flexible architecture. Additionally, the model is capable of combining temperature information from different molecular simulation trajectories to learn a coherent and useful internal representation of thermodynamics. We benchmark D-SPIB on multiple molecular tasks and showcase its potential for exploring physical conditions outside the training set.

Combined Representation and Generation with Diffusive State Predictive Information Bottleneck

TL;DR

The paper tackles the challenge of learning informative representations and generating samples from high‑dimensional molecular distributions under limited data and varying thermodynamic conditions. It introduces Diffusive State Predictive Information Bottleneck (D‑SPIB), which fuses a time‑lagged SPIB representation with a score‑based diffusion prior learned via a VP‑SDE and tempered by temperature embeddings. The resulting objective, , blends prediction accuracy with a diffusion‑based regularizer, enabling joint representation learning and generation that captures thermodynamic dependencies and extrapolates to unseen temperatures. Empirically, D‑SPIB outperforms vanilla SPIB on a three‑hole analytical potential and accurately interpolates temperature‑dependent structure in a 2D LJ7 system, demonstrating data‑efficient thermodynamic modeling with generative capabilities.

Abstract

Generative modeling becomes increasingly data-intensive in high-dimensional spaces. In molecular science, where data collection is expensive and important events are rare, compression to lower-dimensional manifolds is especially important for various downstream tasks, including generation. We combine a time-lagged information bottleneck designed to characterize molecular important representations and a diffusion model in one joint training objective. The resulting protocol, which we term Diffusive State Predictive Information Bottleneck (D-SPIB), enables the balancing of representation learning and generation aims in one flexible architecture. Additionally, the model is capable of combining temperature information from different molecular simulation trajectories to learn a coherent and useful internal representation of thermodynamics. We benchmark D-SPIB on multiple molecular tasks and showcase its potential for exploring physical conditions outside the training set.

Paper Structure

This paper contains 11 sections, 10 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: Diffusive SPIB architecture. Input data $\mathbf{X}$ is encoded to latent $\mathbf{z}_0$ and decoded to state label $\mathbf{y}$. The distribution of the encoded variable $\mathbf{z}_0$ is regularized by the IB prior distribution generated from $\mathbf{z}_1$ using a diffusion model with an easily sampled, pre-defined generative prior distribution. Blow-ups show the diffusion trajectories, including the reference forward trajectories used for training (white), the learned forward trajectories (black), and the backward trajectories employed for sampling (black).
  • Figure 2: A) The distribution of encoded validation data and generated distribution by D-SPIB is shown for the three-hole potential. B) Free energy profiles along a D-SPIB latent dimension for generated data (colored, solid) and molecular dynamics data (black, dashed) in the multi-temperature LJ7 experiment.