Table of Contents
Fetching ...

KL-geodesics flow matching with a novel sampling scheme

Egor Sevriugov, Ivan Oseledets

TL;DR

This work introduces KL-geodesic flow matching (KL-flow) for discrete sequence modeling in non-autoregressive text generation. It provides a theoretical result linking the exact conditional likelihood maximizer $P_\theta(x_1|x_t,t)$ to the flow-matching velocity under logit-space interpolation, and proposes an empirical sampling scheme plus a hybrid inference method to boost performance. Across unconditional generation, conditional generation, and code infilling tasks, KL-flow variants consistently outperform prior discrete flow matching and autoregressive baselines, achieving state-of-the-art results on several benchmarks. The approach offers a scalable, geometry-aware alternative for discrete sequence modeling with broad practical impact for NLP and code tasks.

Abstract

Non-autoregressive language models generate all tokens simultaneously, offering potential speed advantages over traditional autoregressive models, but they face challenges in modeling the complex dependencies inherent in text data. In this work, we investigate a conditional flow matching approach for text generation. We represent tokens as one-hot vectors in a \(V\)-dimensional simplex and utilize geodesics under the Kullback-Leibler (KL) divergence, which correspond to linear interpolation in logit space. We provide a theoretical justification that maximizing the conditional likelihood \(P_θ(x_1 \mid x_t, t)\) yields the exact flow matching velocity under logit interpolation. To address the suboptimal performance of basic inference, we propose a novel empirical sampling scheme that iteratively samples from the conditional distribution and introduces additional noise, significantly improving results despite lacking full theoretical underpinnings. Furthermore, we propose a hybrid inference method that combines the basic approach with the sampling scheme. This method demonstrates superior performance on both conditional and unconditional text generation experiments compared to previous SOTA method for discrete flow matching.

KL-geodesics flow matching with a novel sampling scheme

TL;DR

This work introduces KL-geodesic flow matching (KL-flow) for discrete sequence modeling in non-autoregressive text generation. It provides a theoretical result linking the exact conditional likelihood maximizer to the flow-matching velocity under logit-space interpolation, and proposes an empirical sampling scheme plus a hybrid inference method to boost performance. Across unconditional generation, conditional generation, and code infilling tasks, KL-flow variants consistently outperform prior discrete flow matching and autoregressive baselines, achieving state-of-the-art results on several benchmarks. The approach offers a scalable, geometry-aware alternative for discrete sequence modeling with broad practical impact for NLP and code tasks.

Abstract

Non-autoregressive language models generate all tokens simultaneously, offering potential speed advantages over traditional autoregressive models, but they face challenges in modeling the complex dependencies inherent in text data. In this work, we investigate a conditional flow matching approach for text generation. We represent tokens as one-hot vectors in a -dimensional simplex and utilize geodesics under the Kullback-Leibler (KL) divergence, which correspond to linear interpolation in logit space. We provide a theoretical justification that maximizing the conditional likelihood \(P_θ(x_1 \mid x_t, t)\) yields the exact flow matching velocity under logit interpolation. To address the suboptimal performance of basic inference, we propose a novel empirical sampling scheme that iteratively samples from the conditional distribution and introduces additional noise, significantly improving results despite lacking full theoretical underpinnings. Furthermore, we propose a hybrid inference method that combines the basic approach with the sampling scheme. This method demonstrates superior performance on both conditional and unconditional text generation experiments compared to previous SOTA method for discrete flow matching.

Paper Structure

This paper contains 23 sections, 3 theorems, 27 equations, 7 figures, 4 tables, 2 algorithms.

Key Result

Proposition 3.2

The exact minimizer of the loss functional eq:logit_cfm is given by:

Figures (7)

  • Figure 1: Overview of the Proposed Approach(illustrated using a two-dimensional simplex for simplicity).Training Phase: Initial points $x_0 \sim p_0$ and target points $x_1 \sim p_1$ are sampled. An intermediate point $x_t$ is obtained by interpolating between $x_0$ and $x_1$. The point $x_t$ is passed through a Denoiser network to compute the conditional log-probability $\log p(x_1 | x_t)$. The network is trained by maximizing this log-probability.Basic Inference: Standard inference involves solving an ordinary differential equation (ODE) defined by the vector field $v(x_t, t)$. In the denoising context, this vector field is equal to the expectation of conditional vector field (indicated by dotted arrows) over $x_1 \sim p(x_1 | x_t)$: $\mathbb{E}_{x_1} [v(x_t, t\,|\,x_1)]$. The ODE is numerically solved using the Euler method with a step size $h = 1/N$ over $N$ iterations. Sampling Inference: Alternatively, instead of performing an Euler step, we interpolate between a newly sampled point $x_0 \sim p_0$ and a target point $x_1 \sim p(x_1 | x_t)$ at the next time step $t + h$.
  • Figure 2: Comparison of the performance of DFM and LFM models on the code infilling task under general configurations, with evaluations conducted on the HumanEval dataset.
  • Figure 3: quantitative assessment of the impact of selecting the $k$ parameter for top-k sampling from the conditional probability $p(x_1 | x_t)$ on the quality of the generated texts, measured in terms of perplexity, as well as their variability, assessed through entropy. For reference, the graphs include numerical estimates of the quality and variability of real texts, represented by a horizontal "data" line.
  • Figure 4: Comparison of the impact of learning rate values on training a GPT-like model for the Flow Matching problem. The base implementation utilizes the Muon optimizer for certain model parameters, while the tag "no Muon optimizer" indicates that the Muon optimizer has been replaced with the Adam optimizer.
  • Figure 5: Comparison of various strategies for time insertion within model architecture.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Definition 3.1
  • Proposition 3.2
  • proof
  • Corollary 3.3
  • Proposition 3.4
  • proof