Table of Contents
Fetching ...

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, Aditya Grover

TL;DR

OmniFlow introduces a modular, multi-modal rectified-flow framework that enables any-to-any generation across text, image, and audio by coupling modality-specific streams through joint attention. A novel multi-modal rectified flow with a flexible guidance mechanism allows controllable cross-modal alignment, while a modular architecture enables pretraining of individual modalities and later merging for fine-tuning. Empirical results show OmniFlow achieves strong cross-modal performance, outperforming prior generalist baselines on text-to-image and text-to-audio tasks with reduced data and compute, and offering meaningful insights into design choices for multi-modal diffusion models. The work advances practical multi-modal diffusion by enabling flexible conditioning, efficient training, and unified modeling across modalities, with code release to support reproducibility.

Abstract

We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at https://github.com/jacklishufan/OmniFlows.

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

TL;DR

OmniFlow introduces a modular, multi-modal rectified-flow framework that enables any-to-any generation across text, image, and audio by coupling modality-specific streams through joint attention. A novel multi-modal rectified flow with a flexible guidance mechanism allows controllable cross-modal alignment, while a modular architecture enables pretraining of individual modalities and later merging for fine-tuning. Empirical results show OmniFlow achieves strong cross-modal performance, outperforming prior generalist baselines on text-to-image and text-to-audio tasks with reduced data and compute, and offering meaningful insights into design choices for multi-modal diffusion models. The work advances practical multi-modal diffusion by enabling flexible conditioning, efficient training, and unified modeling across modalities, with code release to support reproducibility.

Abstract

We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at https://github.com/jacklishufan/OmniFlows.

Paper Structure

This paper contains 44 sections, 10 equations, 14 figures, 9 tables, 1 algorithm.

Figures (14)

  • Figure 1: OmniFlow is capable of a diverse range of any-to-any generation tasks. OmniFlow supports generation of any output modalities given any input modality, such as text-to-image, text-to-audio, audio-to-image generations. It also supports tasks in multiple input modalities, such as text+audio-to-image.
  • Figure 2: Pipeline of OmniFlow. Previous any-to-any models such as CoDi codi (Top) concatenate multiple modality-specific encoders and decoders, and naively average the embedding of multiple modalities to achieve joint conditioning. By contrast, OmniFlow (Bottom) is a unified, modular multi-modal model, where features from different modalities directly interact with each other through joint attention layers. OmniFlow is inspired by the modular design of Stable Diffusion 3 esser2024scaling (Middle), a text-to-image model.
  • Figure 3: Architecture of OmniFlow. Left: We highlight the architecture of OmniFlow. Right: We show the design of an individual Omni-Transformer Block.
  • Figure 4: Effect of CFG and Shift for audio and text generation. We evaluate the impact of guidance and timestep shift on text-to-audio and audio-to-text tasks.
  • Figure 5: Effect of Multi-Modal Guidance. In this example, the user can flexibly control the alignment between output text and input image, audio independently by varying $\alpha_{\text{au}}$ and $\alpha_{\text{im}}$. Higher $\alpha_{\text{im}}$ will make the output texts resemble image captions, with visual descriptions such as lined up, driving down. Higher $\alpha_{\text{au}}$ will make the output texts resemble audio captions, with descriptions such as accelerating, revving.
  • ...and 9 more figures