Table of Contents
Fetching ...

Graph Flow Matching: Enhancing Image Generation with Neighbor-Aware Flow Fields

Md Shahriar Rahim Siddiqui, Moshe Eliasof, Eldad Haber

TL;DR

Problem: improve flow matching for image generation by leveraging local neighborhood structure. Approach: Graph Flow Matching (GFM) decomposes the velocity field as $\mathbf{v}(\mathbf{x}, t) = \mathbf{v}_{\text{react}}(\mathbf{x}, t) + \mathbf{v}_{\text{diff}}(\mathbf{x}, t; \mathcal{N}(\mathbf{x}, t))$ and uses a graph neural module on latent codes to implement the diffusion term. Contributions: a modular, backbones-agnostic diffusion term instantiated via MPNN or GPS, validated in latent space with five datasets, achieving consistent FID/recall gains and only modest parameter overhead; no changes to training losses or solvers. Findings: across LSUN Church/Bedroom, FFHQ, AFHQ-Cat, and CelebA-HQ, GFM yields substantial quality improvements while preserving sampling efficiency; ablations confirm graph structure, not simply extra capacity, drives gains. Significance: demonstrates that integrating local geometric priors into continuous-time generative modeling yields robust, scalable improvements in high-fidelity image synthesis.

Abstract

Flow matching casts sample generation as learning a continuous-time velocity field that transports noise to data. Existing flow matching networks typically predict each point's velocity independently, considering only its location and time along its flow trajectory, and ignoring neighboring points. However, this pointwise approach may overlook correlations between points along the generation trajectory that could enhance velocity predictions, thereby improving downstream generation quality. To address this, we propose Graph Flow Matching (GFM), a lightweight enhancement that decomposes the learned velocity into a reaction term -- any standard flow matching network -- and a diffusion term that aggregates neighbor information via a graph neural module. This reaction-diffusion formulation retains the scalability of deep flow models while enriching velocity predictions with local context, all at minimal additional computational cost. Operating in the latent space of a pretrained variational autoencoder, GFM consistently improves Fréchet Inception Distance (FID) and recall across five image generation benchmarks (LSUN Church, LSUN Bedroom, FFHQ, AFHQ-Cat, and CelebA-HQ at $256\times256$), demonstrating its effectiveness as a modular enhancement to existing flow matching architectures.

Graph Flow Matching: Enhancing Image Generation with Neighbor-Aware Flow Fields

TL;DR

Problem: improve flow matching for image generation by leveraging local neighborhood structure. Approach: Graph Flow Matching (GFM) decomposes the velocity field as and uses a graph neural module on latent codes to implement the diffusion term. Contributions: a modular, backbones-agnostic diffusion term instantiated via MPNN or GPS, validated in latent space with five datasets, achieving consistent FID/recall gains and only modest parameter overhead; no changes to training losses or solvers. Findings: across LSUN Church/Bedroom, FFHQ, AFHQ-Cat, and CelebA-HQ, GFM yields substantial quality improvements while preserving sampling efficiency; ablations confirm graph structure, not simply extra capacity, drives gains. Significance: demonstrates that integrating local geometric priors into continuous-time generative modeling yields robust, scalable improvements in high-fidelity image synthesis.

Abstract

Flow matching casts sample generation as learning a continuous-time velocity field that transports noise to data. Existing flow matching networks typically predict each point's velocity independently, considering only its location and time along its flow trajectory, and ignoring neighboring points. However, this pointwise approach may overlook correlations between points along the generation trajectory that could enhance velocity predictions, thereby improving downstream generation quality. To address this, we propose Graph Flow Matching (GFM), a lightweight enhancement that decomposes the learned velocity into a reaction term -- any standard flow matching network -- and a diffusion term that aggregates neighbor information via a graph neural module. This reaction-diffusion formulation retains the scalability of deep flow models while enriching velocity predictions with local context, all at minimal additional computational cost. Operating in the latent space of a pretrained variational autoencoder, GFM consistently improves Fréchet Inception Distance (FID) and recall across five image generation benchmarks (LSUN Church, LSUN Bedroom, FFHQ, AFHQ-Cat, and CelebA-HQ at ), demonstrating its effectiveness as a modular enhancement to existing flow matching architectures.

Paper Structure

This paper contains 31 sections, 10 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Graph flow matching enriches the flow trajectory from the initial distribution ($t=0$) to the target distribution ($t=1$) by connecting, at each intermediate time $t$ (shown as slices) nodes $\mathbf{x}$ (shown as dots) that are latent vectors (VAE codes) of distinct images using attention‑based similarity. The flow network output for each node $\mathbf{x}$ is its flow velocity $\mathbf{v}(\mathbf{x}, t)$. Each dot (node) at $t=0$ represents a Gaussian noise image, while each dot (node) at $t=1$ represents a generated image. Dashed edges indicate lower attention weights.
  • Figure 2: Neighbor-aware flow matching enhances image generation. FFHQ samples ($256\times256$) generated using the same random seed by: (top) baseline ADM U-Net dao2023flow, (middle) ADM with MPNN-based correction, and (bottom) ADM with a GPS-based correction module GPS_rampavsek2022recipe. GFM variants produce more coherent facial features and sharper details compared to the baseline model.
  • Figure 3: LSUN Bedroom and LSUN Church samples ($256\times256$) generated using the same random seed by: (top) baseline DiT-L/2 dao2023flow, (middle) DiT-L/2 with MPNN-based correction, and (bottom) DiT-L/2 with GPS-based graph correction GPS_rampavsek2022recipe. GFM variants generate more complete spatial structures and sharper boundaries compared to the baseline model.
  • Figure 4: Randomly generated samples of LSUN Church using DiT-L/2+MPNN
  • Figure 5: Randomly generated samples of LSUN Bedroom using DiT-L/2+MPNN
  • ...and 3 more figures