OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows
John Nguyen, Marton Havasi, Tariq Berrada, Luke Zettlemoyer, Ricky T. Q. Chen
TL;DR
OneFlow tackles the rigid, autoregressive bottleneck in multimodal generation by unifying text and image synthesis in a non-autoregressive framework that supports variable-length, interleaved outputs. It combines insertion-based Edit Flows for discrete text with Flow Matching for continuous image latents, enabling concurrent generation and per-modality time schedules. Across 1B–8B parameter scales, OneFlow outperforms autoregressive and diffusion baselines on generation and understanding tasks while reducing training FLOPs by up to 50%, and it demonstrates emergent reasoning-like capabilities through hierarchical generation. The work introduces mixed-modal pretraining, interleaved generation, and classifier-free guidance as new capabilities, with practical impact for scalable, unified vision-language systems.
Abstract
We present OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation. Unlike autoregressive models that enforce rigid causal ordering between text and image generation, OneFlow combines an insertion-based Edit Flow for discrete text tokens with Flow Matching for image latents. OneFlow enables concurrent text-image synthesis with hierarchical sampling that prioritizes content over grammar. Through controlled experiments across model sizes from 1B to 8B, we demonstrate that OneFlow outperforms autoregressive baselines on both generation and understanding tasks while using up to 50% fewer training FLOPs. OneFlow surpasses both autoregressive and diffusion-based approaches while unlocking new capabilities for concurrent generation, iterative refinement, and natural reasoning-like generation.
