Table of Contents
Fetching ...

Edit Flows: Flow Matching with Edit Operations

Marton Havasi, Brian Karrer, Itai Gat, Ricky T. Q. Chen

TL;DR

This work tackles non-autoregressive sequence generation, where variable-length outputs and alignment flexibility are challenging. It introduces Edit Flows, a CTMC-based framework that generates sequences via discrete edit operations—insertions, deletions, and substitutions—operating in a position-relative, variable-length fashion. Training leverages an auxiliary alignment process and a Flow-Matching objective with a Bregman divergence to learn the edit-rate model efficiently. Empirically, Edit Flows surpass mask-based and autoregressive baselines on image captioning and code generation, with strong results on text benchmarks, highlighting the potential for practical, scalable non-autoregressive generation.

Abstract

Autoregressive generative models naturally generate variable-length sequences, while non-autoregressive models struggle, often imposing rigid, token-wise structures. We propose Edit Flows, a non-autoregressive model that overcomes these limitations by defining a discrete flow over sequences through edit operations$\unicode{x2013}$insertions, deletions, and substitutions. By modeling these operations within a Continuous-time Markov Chain over the sequence space, Edit Flows enable flexible, position-relative generation that aligns more closely with the structure of sequence data. Our training method leverages an expanded state space with auxiliary variables, making the learning process efficient and tractable. Empirical results show that Edit Flows outperforms both autoregressive and mask models on image captioning and significantly outperforms the mask construction in text and code generation.

Edit Flows: Flow Matching with Edit Operations

TL;DR

This work tackles non-autoregressive sequence generation, where variable-length outputs and alignment flexibility are challenging. It introduces Edit Flows, a CTMC-based framework that generates sequences via discrete edit operations—insertions, deletions, and substitutions—operating in a position-relative, variable-length fashion. Training leverages an auxiliary alignment process and a Flow-Matching objective with a Bregman divergence to learn the edit-rate model efficiently. Empirically, Edit Flows surpass mask-based and autoregressive baselines on image captioning and code generation, with strong results on text benchmarks, highlighting the potential for practical, scalable non-autoregressive generation.

Abstract

Autoregressive generative models naturally generate variable-length sequences, while non-autoregressive models struggle, often imposing rigid, token-wise structures. We propose Edit Flows, a non-autoregressive model that overcomes these limitations by defining a discrete flow over sequences through edit operationsinsertions, deletions, and substitutions. By modeling these operations within a Continuous-time Markov Chain over the sequence space, Edit Flows enable flexible, position-relative generation that aligns more closely with the structure of sequence data. Our training method leverages an expanded state space with auxiliary variables, making the learning process efficient and tractable. Empirical results show that Edit Flows outperforms both autoregressive and mask models on image captioning and significantly outperforms the mask construction in text and code generation.

Paper Structure

This paper contains 34 sections, 4 theorems, 49 equations, 14 figures, 6 tables.

Key Result

Theorem 3.1

Let $u_t(x, z | x_t, z_t)$ be a rate over the augmented space of $\mathcal{X} \times \mathcal{Z}$ that generates $p_t(x, z)$, then and furthermore, for any Bregman divergence $D_\phi(a, b) = \phi(a) - \phi(b) - \langle a - b, \tfrac{\mathrm{d}}{\mathrm{d} b} \phi (b) \rangle$ defined by a convex function $\phi$, we have that

Figures (14)

  • Figure 1: Edit Flow sampling process. Starting with $x_0$ containing random tokens or an empty sequence, the model applies edits to $x_t$ and reaches a cohesive sentence at time $t=1$.
  • Figure 2: Edit Flow model inputs and outputs. Given $x_t$, the model predicts the rate of each possible edit.
  • Figure 3: Computing the loss starts with the two aligned sequences $z_0$ and $z_1$. Locations where $z_0^i={\color{gray}\varepsilon}$ require an insertion operation, locations where $z_1^i={\color{gray}\varepsilon}$ require a deletion and locations where $z_0^i\neq z_1^i$ require a substitution. $z_t$ is sampled by applying a subset of the operations to $z_0$ depending on the scheduler. Then, $x_t$ is obtained by removing all ${\color{gray}\varepsilon}$ tokens from $z_t$. The Monte-Carlo estimate of the loss contains the model output $u_t^\theta(x | x_t)$ in two terms: the negated sum of all the edit rates and the logarithms of the remaining edits between $z_t$ and $z_1$.
  • Figure 4: Edit Flow generation examples with $X_0=\emptyset$ (i.e. insert-only model). The tokens are color coded to denote the timestep that they were generated in. Left: Coding model conditioned on the function signature. Right: Image captioning model conditioned on the image.
  • Figure 5: Example input images and the stochastic sequential generation of captions from an Edit Flows model.
  • ...and 9 more figures

Theorems & Definitions (7)

  • Theorem 3.1: Flow Matching with Auxiliary Processes
  • Theorem B.1: Flow Matching with Auxiliary Processes
  • proof
  • Lemma B.0: Rates that generate $p_t(x, z) = p(x | z)p_t(z)$
  • proof
  • Lemma B.0: Rates that generate $p_t(x, z) = \delta_{f(z)}(x)p_t(z)$
  • proof