Table of Contents
Fetching ...

Text-to-Image Alignment in Denoising-Based Models through Step Selection

Paul Grimal, Hervé Le Borgne, Olivier Ferret

TL;DR

This work addresses text-image misalignment in denoising-based generative models by identifying and exploiting an optimal denoising step to refine semantic content. It extends Generative Semantic Nursing (GSN) to Flow Matching and demonstrates that a single, carefully chosen IterRef step can substantially improve semantic fidelity, outperforming several inference-time baselines. The approach uses attention-based signal enhancement with CN and IoU losses, leveraging CLIP/T5 signals to guide latent refinements, and validates results with TIAM, CLIP-based metrics, and human studies on SD 1.4 and SD3. The findings show that late-stage refinements offer stronger signals for aligning text prompts with generated images, while reducing the hyperparameter burden and enabling efficient, scalable improvements in text-to-image alignment.

Abstract

Visual generative AI models often encounter challenges related to text-image alignment and reasoning limitations. This paper presents a novel method for selectively enhancing the signal at critical denoising steps, optimizing image generation based on input semantics. Our approach addresses the shortcomings of early-stage signal modifications, demonstrating that adjustments made at later stages yield superior results. We conduct extensive experiments to validate the effectiveness of our method in producing semantically aligned images on Diffusion and Flow Matching model, achieving state-of-the-art performance. Our results highlight the importance of a judicious choice of sampling stage to improve performance and overall image alignment.

Text-to-Image Alignment in Denoising-Based Models through Step Selection

TL;DR

This work addresses text-image misalignment in denoising-based generative models by identifying and exploiting an optimal denoising step to refine semantic content. It extends Generative Semantic Nursing (GSN) to Flow Matching and demonstrates that a single, carefully chosen IterRef step can substantially improve semantic fidelity, outperforming several inference-time baselines. The approach uses attention-based signal enhancement with CN and IoU losses, leveraging CLIP/T5 signals to guide latent refinements, and validates results with TIAM, CLIP-based metrics, and human studies on SD 1.4 and SD3. The findings show that late-stage refinements offer stronger signals for aligning text prompts with generated images, while reducing the hyperparameter burden and enabling efficient, scalable improvements in text-to-image alignment.

Abstract

Visual generative AI models often encounter challenges related to text-image alignment and reasoning limitations. This paper presents a novel method for selectively enhancing the signal at critical denoising steps, optimizing image generation based on input semantics. Our approach addresses the shortcomings of early-stage signal modifications, demonstrating that adjustments made at later stages yield superior results. We conduct extensive experiments to validate the effectiveness of our method in producing semantically aligned images on Diffusion and Flow Matching model, achieving state-of-the-art performance. Our results highlight the importance of a judicious choice of sampling stage to improve performance and overall image alignment.

Paper Structure

This paper contains 47 sections, 6 equations, 35 figures, 14 tables.

Figures (35)

  • Figure 1: Samples generated by Stable Diffusion vs. Ours.
  • Figure 2: The diffusion process is paused at a key step (determined on a validation subset) to enhance the signal in the latent image. By amplifying the signal at this critical point, we ensure that the model can correctly construct the main components of the image, leading to a more accurate final result.
  • Figure 3: Value of $a_t$ as a function of the timestep $t$ ($t=0$ for the target distribution and $t=1000$ for the Gaussian). The estimated $\hat{x}_0$ during the generation of "a photo of a tiger on a boat arriving in new york" at various steps is displayed. A coarse-to-fine generation is observed; as the denoising process progresses, the scene becomes increasingly distinguishable. Generated with Stable Diffusion 1.4.
  • Figure 4: Accumulated TIAM scores without (left) and with (right) GSNg. The dataset with three colored entities is excluded on the left due to its low scores. Steps 821 and 941 are identified as optimal.
  • Figure 5: Images generated without GSNg (SD 1.4; same seeds)
  • ...and 30 more figures