Table of Contents
Fetching ...

SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation

Yuhan Pei, Ruoyu Wang, Yongqi Yang, Ye Zhu, Olga Russakovsky, Yu Wu

TL;DR

The paper investigates diffusion-based image generation and identifies inter-regional interference during diffusion as a key challenge. It introduces training-free Cyclic One-Way Diffusion (COW) and Selective One-Way Diffusion (SOW) to achieve pixel-level fidelity and semantic coherence in text-vision-to-image (TV2I) generation. SOW leverages Multimodal Large Language Models (MLLMs) to reason about spatial relations and semantic content, combined with dynamic attention modulation to steer information diffusion contextually. Through experiments on CelebA-TV2I, SOW demonstrates superior condition fidelity, rapid generation, and robust ablations, highlighting a scalable, training-free approach to controllable diffusion-based generation with practical impact for customizable image synthesis.

Abstract

Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference. Building on COW, we further propose Selective One-Way Diffusion (SOW), which utilizes Multimodal Large Language Models (MLLMs) to clarify the semantic and spatial relationships within the image. Based on these insights, SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships. Extensive experiments demonstrate the untapped potential of controlled information diffusion, offering a path to more adaptive and versatile generative models in a learning-free manner.

SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation

TL;DR

The paper investigates diffusion-based image generation and identifies inter-regional interference during diffusion as a key challenge. It introduces training-free Cyclic One-Way Diffusion (COW) and Selective One-Way Diffusion (SOW) to achieve pixel-level fidelity and semantic coherence in text-vision-to-image (TV2I) generation. SOW leverages Multimodal Large Language Models (MLLMs) to reason about spatial relations and semantic content, combined with dynamic attention modulation to steer information diffusion contextually. Through experiments on CelebA-TV2I, SOW demonstrates superior condition fidelity, rapid generation, and robust ablations, highlighting a scalable, training-free approach to controllable diffusion-based generation with practical impact for customizable image synthesis.

Abstract

Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference. Building on COW, we further propose Selective One-Way Diffusion (SOW), which utilizes Multimodal Large Language Models (MLLMs) to clarify the semantic and spatial relationships within the image. Based on these insights, SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships. Extensive experiments demonstrate the untapped potential of controlled information diffusion, offering a path to more adaptive and versatile generative models in a learning-free manner.

Paper Structure

This paper contains 21 sections, 8 equations, 14 figures, 10 tables, 1 algorithm.

Figures (14)

  • Figure 1: Comparisons with existing methods rombach2022ldmcontrolnetgal2022tiruiz2022dreambooth for maintaining the fidelity of text and visual conditions in different application scenarios. We consistently achieve superior fidelity to both text and visual conditions in all three settings. In contrast, other learning-based approaches struggle to attain the same level of performance in diverse scenarios.
  • Figure 1: Text-sensitivity during denoising process. Each line represents a CLIP cosine similarity between the generated image and text with the text condition injected at different steps. The image is generated in 1000 overall unconditional denoising process with 100 steps text conditional guidance starting from $t$. Generally, the denoising process is more responsive to the text condition in the beginning and almost stops reacting to the text condition when high-level semantics are settled.
  • Figure 2: A cognitive-inspired approach for image generation. Starting with a partial visual input (left), we leverage a multimodal large language model to identify key properties (recognition) and infer the missing components (reasoning), guiding the generative model to complete the image (generation). Our selective diffusion mechanism further refines the process by directing information flow to the appropriate regions, ensuring the output (right) is contextually accurate and visually coherent.
  • Figure 2: Different sizes come with different semantic formation processes. Each red curve represents a face disturb-and-reconstruct process, (a) size is $256 \times 256$ and (b) size is $128 \times 128$. We disturb the reconstruction process by sticking the origin image to a random noise background ($512 \times 512$) at different steps. The general semantic is settled earlier when the size is larger.
  • Figure 3: The pipeline of our proposed SOW method. Initially, given the visual condition and text condition, we employ a MLLM Gemini team2023gemini to infer the textual description, the adaptive location box of the visual conditional region, and the box of the condition-related region through a three-stage reasoning process. The input visual condition is then affixed to a predefined background, serving as the seed initialization for the cycle. During the Cyclic One-Way Diffusion process, we "disturb" and "reconstruct" the image in a cyclic way and ensure a continuous one-way diffusion by consistently replacing the image with corresponding $\mathbf{x_t}$. Also, by integrating these prior pieces of information, we execute cyclic diffusion with dynamic attention modulation, enhancing the coherence and accuracy of the generated outputs.
  • ...and 9 more figures