Table of Contents
Fetching ...

Generative Modeling of Discrete Joint Distributions by E-Geodesic Flow Matching on Assignment Manifolds

Bastian Boll, Daniel Gonzalez-Alvarado, Christoph Schnörr

TL;DR

This work develops a geometry-aware generative framework for discrete distributions by operating continuous normalizing flows on the assignment manifold $\mathcal{W}$ and embedding it into the meta-simplex $\mathcal{S}_N$. Training relies on Riemannian flow matching of $e$-geodesics, yielding representations of general discrete joint distributions as convex mixtures of extremal factorizing distributions. A key innovation is the meta-simplex embedding via $T(W)_{\alpha} = \prod_{i} W_{i,\alpha_i}$, which connects the manifold of simple distributions to the full joint distribution space while preserving a maximum-entropy property. Empirical results on image segmentation and likelihood-based diagnostics demonstrate accurate sample generation, efficient training, and effective out-of-distribution detection, highlighting the method's potential as a scalable alternative for discrete data modeling with principled information-geometric grounding.

Abstract

This paper introduces a novel generative model for discrete distributions based on continuous normalizing flows on the submanifold of factorizing discrete measures. Integration of the flow gradually assigns categories and avoids issues of discretizing the latent continuous model like rounding, sample truncation etc. General non-factorizing discrete distributions capable of representing complex statistical dependencies of structured discrete data, can be approximated by embedding the submanifold into a the meta-simplex of all joint discrete distributions and data-driven averaging. Efficient training of the generative model is demonstrated by matching the flow of geodesics of factorizing discrete distributions. Various experiments underline the approach's broad applicability.

Generative Modeling of Discrete Joint Distributions by E-Geodesic Flow Matching on Assignment Manifolds

TL;DR

This work develops a geometry-aware generative framework for discrete distributions by operating continuous normalizing flows on the assignment manifold and embedding it into the meta-simplex . Training relies on Riemannian flow matching of -geodesics, yielding representations of general discrete joint distributions as convex mixtures of extremal factorizing distributions. A key innovation is the meta-simplex embedding via , which connects the manifold of simple distributions to the full joint distribution space while preserving a maximum-entropy property. Empirical results on image segmentation and likelihood-based diagnostics demonstrate accurate sample generation, efficient training, and effective out-of-distribution detection, highlighting the method's potential as a scalable alternative for discrete data modeling with principled information-geometric grounding.

Abstract

This paper introduces a novel generative model for discrete distributions based on continuous normalizing flows on the submanifold of factorizing discrete measures. Integration of the flow gradually assigns categories and avoids issues of discretizing the latent continuous model like rounding, sample truncation etc. General non-factorizing discrete distributions capable of representing complex statistical dependencies of structured discrete data, can be approximated by embedding the submanifold into a the meta-simplex of all joint discrete distributions and data-driven averaging. Efficient training of the generative model is demonstrated by matching the flow of geodesics of factorizing discrete distributions. Various experiments underline the approach's broad applicability.
Paper Structure (18 sections, 32 equations, 9 figures)

This paper contains 18 sections, 32 equations, 9 figures.

Figures (9)

  • Figure 1: The tetrahedron represents in local coordinates all joint distributions $w\in\Delta^{4}$ (with $4=c^{n}$) of $n=2$ variables taking $c=2$ values. The embedded surface is the assignment manifold $\mathcal{W}$ of all factorizing distributions of 2 binary variables. The blue point represents a target joint distribution $p(y_{1},y_{2})=\frac{1}{100}(45,5,5,45)^{\top}$ with strong statistical dependency, i.e. it is not close to the any factorizing distribution. This paper introduces a generative model for representing arbitrary discrete distributions as convex combination of hard category assignment distributions corresponding to the extreme points. Figure \ref{['fig:Wright-AF']} illustrates the representation of the target distribution (blue point).
  • Figure 2: Visualization of 1000 samples from the target distribution (blue point; cf. Figure \ref{['fig:Wright']}). Each sample corresponds to an integral curve $T(W(t))$\ref{['eq:def-T-embedding']} of the assignment flow ODE \ref{['eq:def-AF']} on the embedded submanifold of factorizing distributions $\mathcal{W}\subseteq\mathcal{S}_{4}$, which can be computed efficiently by geometric integration. The entire assignment flow pushes forward a standard Gaussian reference distribution on the tangent space at the barycenter (red point), which is lifted to the submanifold and transported to the extreme points. The resulting 'weights' represent the blue target distribution as convex combination. The parametrized vector field of the generative model is trained in a stable and efficient way by matching e-geodesic curves on the assignment manifold, which represent the training data and can be computed in closed form.
  • Figure 3: Overview of the approach: The standard Gaussian reference measure $\mathcal{N}(0,I)$ is pushed forward by the exponential map $\exp_{W}$ from the flat tangent product space $\mathcal{T}_{0}$ to the assignment manifold $\mathcal{W}$, and further to the meta-simplex $\mathcal{S}_{N}$\ref{['eq:def-SN']} by geometrically integrating the assignment flow \ref{['eq:def-AF']}. Since the assignment flow converges to the extreme points of $\overline{\mathcal{W}}$ which agree with the extreme points of $\mathcal{S}_{N}$, an approximation $\widetilde{p}(\alpha)$ of a general discrete target measure $p(\alpha)$ underlying given data can be approximated by matching the flow of e-geodesics (corresponding to data samples) and convex combination in terms of embedded factorized distributions $T(W),\,W\in\mathcal{W}$ and empirical expectation.
  • Figure 4: Left: Random samples drawn from our model trained on discrete Cityscapes segmentation data ($c=8$ classes) at resolution $128\times 256$. Right with blue border: Randomly drawn training data.
  • Figure 5: Histogram of samples from our model fitting the joint distribution of $n=2$ discrete random variables. Left and middle: $c=91$ classes per variable. Right: $c=2$ classes per variable. All three plots show values of the joint distribution. Clearly, the model is able to fit multi-modal joint distributions which do not factorize into independent marginals. The plot on the right is the joint distribution shown as blue dot in Figures \ref{['fig:Wright']} and \ref{['fig:Wright-AF']}.
  • ...and 4 more figures