Table of Contents
Fetching ...

Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling

Michal Balcerak, Tamaz Amiranashvili, Antonio Terpin, Suprosanna Shit, Lea Bogensperger, Sebastian Kaltenbach, Petros Koumoutsakos, Bjoern Menze

TL;DR

Energy Matching addresses the limitations of flow-based and diffusion models in integrating priors and partial observations by unifying transport dynamics with an energy-based likelihood via a time-independent scalar potential $V_\theta(x)$. Grounded in the Jordan–Kinderlehrer–Otto framework, it uses a two-regime training procedure that transports samples from noise to the data manifold with an OT-like flow, then concentrates probability mass around data via a Boltzmann equilibrium $\rho_{eq}(x) \propto \exp(-V_\theta(x)/\varepsilon_{\max})$. The approach yields a single scalar energy whose gradient drives efficient generation and serves as a flexible prior for inverse problems, with additional interaction energies enabling controlled diversity; it reports state-of-the-art fidelity on CIFAR-10 and ImageNet relative to prior EBMs while avoiding auxiliary generators. Moreover, Energy Matching provides direct access to the data likelihood structure and enables LID estimation through the Hessian of $V_\theta$, offering insights with fewer approximations than diffusion methods. Overall, the framework broadens the practicality and adoption of EBMs by delivering simulation-free transport, explicit likelihood modeling, and versatile priors for generative modeling across diverse domains.

Abstract

Current state-of-the-art generative models map noise to data distributions by matching flows or scores. A key limitation of these models is their inability to readily integrate available partial observations and additional priors. In contrast, energy-based models (EBMs) address this by incorporating corresponding scalar energy terms. Here, we propose Energy Matching, a framework that endows flow-based approaches with the flexibility of EBMs. Far from the data manifold, samples move from noise to data along irrotational, optimal transport paths. As they approach the data manifold, an entropic energy term guides the system into a Boltzmann equilibrium distribution, explicitly capturing the underlying likelihood structure of the data. We parameterize these dynamics with a single time-independent scalar field, which serves as both a powerful generator and a flexible prior for effective regularization of inverse problems. The present method substantially outperforms existing EBMs on CIFAR-10 and ImageNet generation in terms of fidelity, while retaining simulation-free training of transport-based approaches away from the data manifold. Furthermore, we leverage the flexibility of the method to introduce an interaction energy that supports the exploration of diverse modes, which we demonstrate in a controlled protein generation setting. This approach learns a scalar potential energy, without time conditioning, auxiliary generators, or additional networks, marking a significant departure from recent EBM methods. We believe this simplified yet rigorous formulation significantly advances EBMs capabilities and paves the way for their wider adoption in generative modeling in diverse domains.

Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling

TL;DR

Energy Matching addresses the limitations of flow-based and diffusion models in integrating priors and partial observations by unifying transport dynamics with an energy-based likelihood via a time-independent scalar potential . Grounded in the Jordan–Kinderlehrer–Otto framework, it uses a two-regime training procedure that transports samples from noise to the data manifold with an OT-like flow, then concentrates probability mass around data via a Boltzmann equilibrium . The approach yields a single scalar energy whose gradient drives efficient generation and serves as a flexible prior for inverse problems, with additional interaction energies enabling controlled diversity; it reports state-of-the-art fidelity on CIFAR-10 and ImageNet relative to prior EBMs while avoiding auxiliary generators. Moreover, Energy Matching provides direct access to the data likelihood structure and enables LID estimation through the Hessian of , offering insights with fewer approximations than diffusion methods. Overall, the framework broadens the practicality and adoption of EBMs by delivering simulation-free transport, explicit likelihood modeling, and versatile priors for generative modeling across diverse domains.

Abstract

Current state-of-the-art generative models map noise to data distributions by matching flows or scores. A key limitation of these models is their inability to readily integrate available partial observations and additional priors. In contrast, energy-based models (EBMs) address this by incorporating corresponding scalar energy terms. Here, we propose Energy Matching, a framework that endows flow-based approaches with the flexibility of EBMs. Far from the data manifold, samples move from noise to data along irrotational, optimal transport paths. As they approach the data manifold, an entropic energy term guides the system into a Boltzmann equilibrium distribution, explicitly capturing the underlying likelihood structure of the data. We parameterize these dynamics with a single time-independent scalar field, which serves as both a powerful generator and a flexible prior for effective regularization of inverse problems. The present method substantially outperforms existing EBMs on CIFAR-10 and ImageNet generation in terms of fidelity, while retaining simulation-free training of transport-based approaches away from the data manifold. Furthermore, we leverage the flexibility of the method to introduce an interaction energy that supports the exploration of diverse modes, which we demonstrate in a controlled protein generation setting. This approach learns a scalar potential energy, without time conditioning, auxiliary generators, or additional networks, marking a significant departure from recent EBM methods. We believe this simplified yet rigorous formulation significantly advances EBMs capabilities and paves the way for their wider adoption in generative modeling in diverse domains.

Paper Structure

This paper contains 43 sections, 18 equations, 8 figures, 5 tables, 3 algorithms.

Figures (8)

  • Figure 1: Trajectories (green lines) of samples traveling from a noise distribution (black dots; here, a Gaussian mixture model) to a data distribution (blue dots; here, two moons as in tongimproving) under four different methods: Action Matching pmlr-v202-neklyudov23a, Flow Matching (OT-CFM) tongimproving, ebm trained via contrastive divergence hinton2002training, and our proposed Energy Matching. We highlight several individual trajectories in red to illustrate their distinct behaviors. Both Action Matching and Flow Matching learn time-dependent transports and are not trained for traversing the data manifold. Conversely, ebm and Energy Matching are driven by time-independent fields that can be iterated indefinitely, allowing trajectories to navigate across modes. While samples from ebm often require additional steps to equilibrate (see, e.g., the visible mode collapses that slow down sampling from the data manifold), Energy Matching directs samples toward the data distribution in "straight" paths, without hindering the exploration of the data manifold.
  • Figure 2: Controlled inpainting for diverse reconstructions. On the left is the masked face. On the right are two reconstructions: the top pair without the interaction term and the bottom pair with it. The interaction term applies in the solid red square (where $B$ has ones), and the measurement matrix $A$ is the dotted blue square (zeros inside, ones outside). By encouraging $x_1$ and $x_2$ to differ in the target region, the interaction yields a wider range of completions while preserving fidelity.
  • Figure 3: Fitness–diversity trade-off for protein inverse design on the AAV Medium (left) and Hard (right) benchmarks. We compare our Energy Matching method (blue), with diversity explicitly controlled by a repulsion strength parameter ($\propto\frac{1}{\sigma^2}$), against leading flow-based (purple), score-based (orange), and other non-likelihood methods (black). Fitness measures how well generated sequences satisfy the target property (predicted viral packaging efficiency), while diversity quantifies the average Levenshtein distance between sequences in each generated batch.
  • Figure 4: Qualitative results for lid estimation using the Hessian spectrum of $V_\theta(x)$. Left: Spectrum for a low-lid image. Right: Spectrum for a high-lid image. The eigenvalues quantify curvature along principal directions (eigenvectors). A degenerate spectrum (many near-zero eigenvalues, marked in red) indicates locally "flat" regions, revealing the lid. Intuitively, higher image complexity often corresponds to a higher lid.
  • Figure 5: Visualization of the energy $V_\theta(x)$ landscapes driving the samples from eight Gaussians to two moons. See \ref{['fig:cover']} for the 2D perspective. (a) The OT flow loss enforces zero curvature in $V_{\theta}(x)$ along the trajectories to the target. (b) Around the 2 Moons, the curvature of $V_{\theta}(x)$ is adjusted to approximate $\log p_{\text{moons}}(x) \propto V_{\theta}(x)$ while remaining close to the pretrained landscape elsewhere. Combining these objectives yields a potential energy landscape that is both efficient for sampling and representative of the underlying target data distribution. (c) An ebm is shown for comparison, trained using contrastive divergence loss. Visible mode collapse that slows down the equilibration. Less regular landscape away from the data as it needs many simulations to explore it.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Remark 2.1: ot solver