Minibatch Optimal Transport and Perplexity Bound Estimation in Discrete Flow Matching
Etrit Haxholli, Yeti Z. Gürbüz, Oğul Can, Eli Waxman
TL;DR
The paper tackles the challenge of modeling discrete, categorical data with flow-based methods, where non-deterministic discrete paths prevent straightforward rectification and precise likelihood estimation. It introduces a dynamic optimal-transport objective for discrete flows with convex interpolants, and proves a Kantorovich formulation that yields a categorical Benamou–Brenier-type theorem, with costs defined by inter-state similarity. Two practical perplexity bounds are derived to enable principled training and model comparison, including a KL-based bound and an entropy-based bound that generalize prior discrete-diffusion bounds; these bounds support evaluation and guide optimization. The authors further present Multimask Flows and show that minibatch OT reduces required inference steps by up to 8x on GPT-2–sized models while preserving diversity, enabling scalable, efficient discrete generation. Empirical results on small proofs of concept and OpenWebText-scale tasks demonstrate substantial jumps-reduction and competitive perplexity across settings, validating the proposed framework and bounds as practical tools for discrete-flow modeling and comparison with autoregressive and discrete diffusion baselines.
Abstract
Discrete flow matching, a recent framework for modeling categorical data, has shown competitive performance with autoregressive models. However, unlike continuous flow matching, the rectification strategy cannot be applied due to the stochasticity of discrete paths, necessitating alternative methods to minimize state transitions. We propose a dynamic-optimal-transport-like minimization objective and derive its Kantorovich formulation for discrete flows with convex interpolants, where transport cost depends solely on inter-state similarity and can be optimized via minibatch strategies. In the case of bag-of-words (BoW) sourced flows, we show that such methods can reduce the number of transitions up to 8 times (1024 to 128) to reach the same generative perplexity without compromising diversity. Additionally, path nondeterminism in discrete flows precludes an instantaneous change-of-variables analogue, preventing precise probability estimation available to continuous flows. We therefore propose two upper bounds on perplexity, enabling principled training, evaluation and model comparison. Finally, we introduce Multimask Flows which outperform masked flows in generative perplexity, particularly when utilizing minibatch Optimal Transport, without sacrificing diversity.
