OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Shufan Li; Konstantinos Kallidromitis; Akash Gokul; Zichun Liao; Yusuke Kato; Kazuki Kozuka; Aditya Grover

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, Aditya Grover

TL;DR

OmniFlow introduces a modular, multi-modal rectified-flow framework that enables any-to-any generation across text, image, and audio by coupling modality-specific streams through joint attention. A novel multi-modal rectified flow with a flexible guidance mechanism allows controllable cross-modal alignment, while a modular architecture enables pretraining of individual modalities and later merging for fine-tuning. Empirical results show OmniFlow achieves strong cross-modal performance, outperforming prior generalist baselines on text-to-image and text-to-audio tasks with reduced data and compute, and offering meaningful insights into design choices for multi-modal diffusion models. The work advances practical multi-modal diffusion by enabling flexible conditioning, efficient training, and unified modeling across modalities, with code release to support reproducibility.

Abstract

We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at https://github.com/jacklishufan/OmniFlows.

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

TL;DR

Abstract

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)