Simplex-to-Euclidean Bijections for Categorical Flow Matching
Bernardo Williams, Victor M. Yeom-Song, Marcelo Hartmann, Arto Klami
TL;DR
This work tackles learning and sampling from distributions on the unit simplex $\mathring{\Delta}^D$ by introducing Simplex-to-Euclidean Flow Matching, which maps the simplex interior to $\mathbb{R}^D$ using geometry-driven bijections from Aitchison geometry. It presents two concrete bijections, the isometric ILR transform and the centered SB transform, to enable Euclidean training of generative models while preserving the ability to recover discrete categories via an efficient Dirichlet interpolation for boundary data. Discrete observations are lifted into the interior through a Dirichlet interpolation, ensuring exact category recovery via $\arg\max$ after inversion, with a practical estimator for categorical probabilities. Empirically, FM-$\mathring{\Delta}$ achieves competitive or superior results across compositional data, binarized MNIST, DNA sequence generation, and Text8, while maintaining scalability; the method provides a principled, geometry-aware bridge between continuous Euclidean generation and discrete simplex data. The approach is flexible and could be extended to other Euclidean generative models beyond Flow Matching, maintaining exact discrete recovery thanks to the interpolation scheme.
Abstract
We propose a method for learning and sampling from probability distributions supported on the simplex. Our approach maps the open simplex to Euclidean space via smooth bijections, leveraging the Aitchison geometry to define the mappings, and supports modeling categorical data by a Dirichlet interpolation that dequantizes discrete observations into continuous ones. This enables density modeling in Euclidean space through the bijection while still allowing exact recovery of the original discrete distribution. Compared to previous methods that operate on the simplex using Riemannian geometry or custom noise processes, our approach works in Euclidean space while respecting the Aitchison geometry, and achieves competitive performance on both synthetic and real-world data sets.
