Table of Contents
Fetching ...

OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows

John Nguyen, Marton Havasi, Tariq Berrada, Luke Zettlemoyer, Ricky T. Q. Chen

TL;DR

OneFlow tackles the rigid, autoregressive bottleneck in multimodal generation by unifying text and image synthesis in a non-autoregressive framework that supports variable-length, interleaved outputs. It combines insertion-based Edit Flows for discrete text with Flow Matching for continuous image latents, enabling concurrent generation and per-modality time schedules. Across 1B–8B parameter scales, OneFlow outperforms autoregressive and diffusion baselines on generation and understanding tasks while reducing training FLOPs by up to 50%, and it demonstrates emergent reasoning-like capabilities through hierarchical generation. The work introduces mixed-modal pretraining, interleaved generation, and classifier-free guidance as new capabilities, with practical impact for scalable, unified vision-language systems.

Abstract

We present OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation. Unlike autoregressive models that enforce rigid causal ordering between text and image generation, OneFlow combines an insertion-based Edit Flow for discrete text tokens with Flow Matching for image latents. OneFlow enables concurrent text-image synthesis with hierarchical sampling that prioritizes content over grammar. Through controlled experiments across model sizes from 1B to 8B, we demonstrate that OneFlow outperforms autoregressive baselines on both generation and understanding tasks while using up to 50% fewer training FLOPs. OneFlow surpasses both autoregressive and diffusion-based approaches while unlocking new capabilities for concurrent generation, iterative refinement, and natural reasoning-like generation.

OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows

TL;DR

OneFlow tackles the rigid, autoregressive bottleneck in multimodal generation by unifying text and image synthesis in a non-autoregressive framework that supports variable-length, interleaved outputs. It combines insertion-based Edit Flows for discrete text with Flow Matching for continuous image latents, enabling concurrent generation and per-modality time schedules. Across 1B–8B parameter scales, OneFlow outperforms autoregressive and diffusion baselines on generation and understanding tasks while reducing training FLOPs by up to 50%, and it demonstrates emergent reasoning-like capabilities through hierarchical generation. The work introduces mixed-modal pretraining, interleaved generation, and classifier-free guidance as new capabilities, with practical impact for scalable, unified vision-language systems.

Abstract

We present OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation. Unlike autoregressive models that enforce rigid causal ordering between text and image generation, OneFlow combines an insertion-based Edit Flow for discrete text tokens with Flow Matching for image latents. OneFlow enables concurrent text-image synthesis with hierarchical sampling that prioritizes content over grammar. Through controlled experiments across model sizes from 1B to 8B, we demonstrate that OneFlow outperforms autoregressive baselines on both generation and understanding tasks while using up to 50% fewer training FLOPs. OneFlow surpasses both autoregressive and diffusion-based approaches while unlocking new capabilities for concurrent generation, iterative refinement, and natural reasoning-like generation.

Paper Structure

This paper contains 51 sections, 26 equations, 25 figures, 5 tables, 3 algorithms.

Figures (25)

  • Figure 1: OneFlow is a variable-length non-autoregressive model that can concurrently generate interleaved text and variable number of images using insertions as a primitive operation.
  • Figure 2: Text-to-image generation. Generated images at 512$\times$512 resolution from OneFlow. Prompts are in Figure \ref{['fig:generated_images_with_prompts']}.
  • Figure 3: Visual question answering. OneFlow generates by simply inserting tokens based on confidence, resulting in a natural hierarchical sampling and implicit reasoning where the most difficult answer tokens are generated later.
  • Figure 4: Concurrent interleaved text & image generation. OneFlow can insert variable number of images in the generated sequence, which are concurrently denoised alongside the text. This allows the text and images to depend on each other during the generation process.
  • Figure 5: Performance of OneFlow vs. AR+FM baseline models at different model scales, data and compute. For text-to-image generation, we report DPG-Bench and FID. For image-to-text caption quality, we report CIDEr and ROUGE. In every benchmark, OneFlow consistently exhibits better scaling laws than AR + FM. Model sizes include 1B, 3B, and 8B.
  • ...and 20 more figures