Table of Contents
Fetching ...

Simplex-to-Euclidean Bijections for Categorical Flow Matching

Bernardo Williams, Victor M. Yeom-Song, Marcelo Hartmann, Arto Klami

TL;DR

This work tackles learning and sampling from distributions on the unit simplex $\mathring{\Delta}^D$ by introducing Simplex-to-Euclidean Flow Matching, which maps the simplex interior to $\mathbb{R}^D$ using geometry-driven bijections from Aitchison geometry. It presents two concrete bijections, the isometric ILR transform and the centered SB transform, to enable Euclidean training of generative models while preserving the ability to recover discrete categories via an efficient Dirichlet interpolation for boundary data. Discrete observations are lifted into the interior through a Dirichlet interpolation, ensuring exact category recovery via $\arg\max$ after inversion, with a practical estimator for categorical probabilities. Empirically, FM-$\mathring{\Delta}$ achieves competitive or superior results across compositional data, binarized MNIST, DNA sequence generation, and Text8, while maintaining scalability; the method provides a principled, geometry-aware bridge between continuous Euclidean generation and discrete simplex data. The approach is flexible and could be extended to other Euclidean generative models beyond Flow Matching, maintaining exact discrete recovery thanks to the interpolation scheme.

Abstract

We propose a method for learning and sampling from probability distributions supported on the simplex. Our approach maps the open simplex to Euclidean space via smooth bijections, leveraging the Aitchison geometry to define the mappings, and supports modeling categorical data by a Dirichlet interpolation that dequantizes discrete observations into continuous ones. This enables density modeling in Euclidean space through the bijection while still allowing exact recovery of the original discrete distribution. Compared to previous methods that operate on the simplex using Riemannian geometry or custom noise processes, our approach works in Euclidean space while respecting the Aitchison geometry, and achieves competitive performance on both synthetic and real-world data sets.

Simplex-to-Euclidean Bijections for Categorical Flow Matching

TL;DR

This work tackles learning and sampling from distributions on the unit simplex by introducing Simplex-to-Euclidean Flow Matching, which maps the simplex interior to using geometry-driven bijections from Aitchison geometry. It presents two concrete bijections, the isometric ILR transform and the centered SB transform, to enable Euclidean training of generative models while preserving the ability to recover discrete categories via an efficient Dirichlet interpolation for boundary data. Discrete observations are lifted into the interior through a Dirichlet interpolation, ensuring exact category recovery via after inversion, with a practical estimator for categorical probabilities. Empirically, FM- achieves competitive or superior results across compositional data, binarized MNIST, DNA sequence generation, and Text8, while maintaining scalability; the method provides a principled, geometry-aware bridge between continuous Euclidean generation and discrete simplex data. The approach is flexible and could be extended to other Euclidean generative models beyond Flow Matching, maintaining exact discrete recovery thanks to the interpolation scheme.

Abstract

We propose a method for learning and sampling from probability distributions supported on the simplex. Our approach maps the open simplex to Euclidean space via smooth bijections, leveraging the Aitchison geometry to define the mappings, and supports modeling categorical data by a Dirichlet interpolation that dequantizes discrete observations into continuous ones. This enables density modeling in Euclidean space through the bijection while still allowing exact recovery of the original discrete distribution. Compared to previous methods that operate on the simplex using Riemannian geometry or custom noise processes, our approach works in Euclidean space while respecting the Aitchison geometry, and achieves competitive performance on both synthetic and real-world data sets.

Paper Structure

This paper contains 44 sections, 10 theorems, 57 equations, 7 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Let $\langle\cdot,\cdot\rangle_A$ denote the Aitchison inner product on $\mathring \Delta^{D}$ and $\langle\cdot,\cdot\rangle_2$ the standard inner product on $\mathbb{R}^{D}$. For a Helmert matrix $\boldsymbol H\in\mathbb{R}^{D\times K}$, the ILR map satisfies and, in particular, the ILR map is an isometry between $(\mathring\Delta^{D},\langle\cdot,\cdot\rangle_A)$ and $(\mathbb{R}^{D},\langle\c

Figures (7)

  • Figure 1: We stochastically interpolate categorical observations (color) to distributions on the interior of a simplex (left). The resulting Dirichlet mixture is transformed to Euclidean space (right) with a bijection ${\varphi}$, enabling use of standard continuous generative models, like conditional flow matching. Discrete samples are obtained by composition of the inverse transformation ${\varphi}^{-1}$ and $\arg\max$ operation.
  • Figure 2: Dirichlet interpolation. The $\lambda$ parameter controls the mixture distribution. For $\lambda \ge \tfrac{1}{2}$ the supports do not overlap and we can recover the categories. Large $\lambda$ unnecessarily concentrates the mass around the simplex borders we want to avoid.
  • Figure 3: Samples from Checkerboard on the simplex. Red points indicate samples not aligned with the true density ($\%$ indicated in caption). The zoomed area shows the top region $x_1\geq \tfrac{4}{5}$, emphasizing the differences.
  • Figure 4: Divergence between the ground truth and estimated categorical probabilities, for problems of varying number of categories.
  • Figure 5: Samples from BMNIST from the different methods. LinearFM draws samples of visually lower quality than the rest of the methods.
  • ...and 2 more figures

Theorems & Definitions (15)

  • Theorem 1: Isometry, Egozcue2003
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • proof
  • Proposition 5
  • proof
  • Proposition 6
  • proof
  • ...and 5 more