Table of Contents
Fetching ...

Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution

Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, Mannat Singh

TL;DR

CrossFlow presents a direct, noise-free cross-modal flow matching framework that evolves one modality into another without conditioning. By incorporating a Variational Encoder to regularize the source latent and a binary CFG indicator, it achieves competitive text-to-image results with a vanilla transformer and demonstrates latent arithmetic and bi-directional mappings. The approach extends effectively to image captioning, depth estimation, and super-resolution, showing strong scalability and task-agnostic performance. Overall, CrossFlow offers a simpler, efficient pathway for cross-modal generation with broad practical impact in multi-modal AI systems.

Abstract

Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.

Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution

TL;DR

CrossFlow presents a direct, noise-free cross-modal flow matching framework that evolves one modality into another without conditioning. By incorporating a Variational Encoder to regularize the source latent and a binary CFG indicator, it achieves competitive text-to-image results with a vanilla transformer and demonstrates latent arithmetic and bi-directional mappings. The approach extends effectively to image captioning, depth estimation, and super-resolution, showing strong scalability and task-agnostic performance. Overall, CrossFlow offers a simpler, efficient pathway for cross-modal generation with broad practical impact in multi-modal AI systems.

Abstract

Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.

Paper Structure

This paper contains 27 sections, 7 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: We propose CrossFlow, a general and simple framework that directly evolves one modality to another using flow matching with no additional conditioning. This is enabled using a vanilla transformer without cross-attention, achieving comparable performance with state-of-the-art models on (a) text-to-image generation, and (b) various other tasks, without requiring task specific architectures.
  • Figure 2: CrossFlow Architecture. CrossFlow enables direct evolution between two different modalities. Taking text-to-image generation as an example, our T2I model comprises two main components: a Text Variational Encoder and a standard flow matching model. At inference time, we utilize the Text Variational Encoder to extract the text latent $z_0\in\mathbb{R}^{h \times w \times c}$ from text embedding $x\in\mathbb{R}^{n\times d}$ produced by any language model. Then we directly evolve this text latent into the image space to generate image latent $z_1\in\mathbb{R}^{h \times w \times c}$.
  • Figure 3: Performance vs. Model Parameters and Iterations. We compare the baseline of starting from noise with text cross-attention with CrossFlow, while controlling for data, model size and training steps. Left: Larger models are able to exploit the cross-modality connection better. Right: CrossFlow needs more steps to converge, but converges to better final performance. Overall, CrossFlow scales better than the baseline and can serve as the framework for future media generation models.
  • Figure 4: CrossFlow provides visually smooth interpolations in the latent space. We show images generated by linear interpolation between the first (left) and second (right) text latents. CrossFlow enables visually smooth transformations of object direction, composite colors, shapes, background scenes, and even object categories. Please zoom in for better visualization. For brevity, we display only 7 interpolating images here; additional interpolating images can be found in \ref{['sec:supp:addl_qual']} (\ref{['fig:supp_interp_2']} and \ref{['fig:supp_interp_3']}).
  • Figure 5: CrossFlow allows arithmetic in text latent space. Using the Text Variational Encoder (VE), we first map the input text into the latent space $z_0$. Arithmetic operations are then performed in this latent space, and the resulting latent representation is used to generate the corresponding image. The latent code $z_0$ used to generate each image is provided at the bottom.
  • ...and 8 more figures