Text-to-Image Alignment in Denoising-Based Models through Step Selection
Paul Grimal, Hervé Le Borgne, Olivier Ferret
TL;DR
This work addresses text-image misalignment in denoising-based generative models by identifying and exploiting an optimal denoising step to refine semantic content. It extends Generative Semantic Nursing (GSN) to Flow Matching and demonstrates that a single, carefully chosen IterRef step can substantially improve semantic fidelity, outperforming several inference-time baselines. The approach uses attention-based signal enhancement with CN and IoU losses, leveraging CLIP/T5 signals to guide latent refinements, and validates results with TIAM, CLIP-based metrics, and human studies on SD 1.4 and SD3. The findings show that late-stage refinements offer stronger signals for aligning text prompts with generated images, while reducing the hyperparameter burden and enabling efficient, scalable improvements in text-to-image alignment.
Abstract
Visual generative AI models often encounter challenges related to text-image alignment and reasoning limitations. This paper presents a novel method for selectively enhancing the signal at critical denoising steps, optimizing image generation based on input semantics. Our approach addresses the shortcomings of early-stage signal modifications, demonstrating that adjustments made at later stages yield superior results. We conduct extensive experiments to validate the effectiveness of our method in producing semantically aligned images on Diffusion and Flow Matching model, achieving state-of-the-art performance. Our results highlight the importance of a judicious choice of sampling stage to improve performance and overall image alignment.
