Table of Contents
Fetching ...

SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation

Paul Grimal, Michaël Soumm, Hervé Le Borgne, Olivier Ferret, Akihiro Sugimoto

TL;DR

SAGA reframes alignment in text-to-image generation as learning a signal-aligned Gaussian over intermediate latents conditioned on the prompt. By approximating $p(\mathbf{z}_t|\mathbf{y})$ with a Gaussian centered at $a_t\tilde{\boldsymbol{\mu}}_{\mathbf{y}}$ and stabilizing the signal through rescaling, it enables training-free sampling of multiple high-fidelity, prompt-consistent images, even with bounding-box conditioning. A cross-modal attention criterion and an inference-time optimization strategy selectively adjust the latent distribution to reflect the target semantics while preserving in-distribution behavior. Empirically, SAGA and its variants outperform strong training-free baselines on SD 1.4 and SD 3 across text and layout-conditioned tasks, with user studies corroborating improved semantic alignment and image quality. The approach offers practical benefits for real-world deployment by enabling efficient, diverse, high-quality generations without retraining the underlying diffusion/flow models.

Abstract

State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities -- such as bounding boxes -- for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. The code is available at https://github.com/grimalPaul/gsn-factory.

SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation

TL;DR

SAGA reframes alignment in text-to-image generation as learning a signal-aligned Gaussian over intermediate latents conditioned on the prompt. By approximating with a Gaussian centered at and stabilizing the signal through rescaling, it enables training-free sampling of multiple high-fidelity, prompt-consistent images, even with bounding-box conditioning. A cross-modal attention criterion and an inference-time optimization strategy selectively adjust the latent distribution to reflect the target semantics while preserving in-distribution behavior. Empirically, SAGA and its variants outperform strong training-free baselines on SD 1.4 and SD 3 across text and layout-conditioned tasks, with user studies corroborating improved semantic alignment and image quality. The approach offers practical benefits for real-world deployment by enabling efficient, diverse, high-quality generations without retraining the underlying diffusion/flow models.

Abstract

State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities -- such as bounding boxes -- for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. The code is available at https://github.com/grimalPaul/gsn-factory.

Paper Structure

This paper contains 67 sections, 3 theorems, 18 equations, 30 figures, 15 tables, 3 algorithms.

Key Result

Proposition 1

Consider a generative model that produces latent representations $\mathbf{z}_0$ conditioned on a prompt $\mathbf{y}$. In the forward process, the latents are noised according to $\mathbf{z}_t = a_t\mathbf{z}_0 + b_t \boldsymbol{\epsilon}, \ \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I

Figures (30)

  • Figure 1: (Left) Text-image alignment issues on SD 3 vs. Ours. Subject mixing: the giraffe's features are incorrectly applied to the dog for SD 3. Catastrophic neglect: the zebra is missing for SD 3. (Right) GenEval performance of SAGA against models.
  • Figure 2: Comparison of different alignment approaches, aiming to generate samples from a distribution $p(\mathbf{z}_0|\mathbf{y})$ for a given prompt $\mathbf{y}$. While GSN (\ref{['fig:interp_latent_GSN']}) corrects the latents at specific timesteps and InitNO (\ref{['fig:interp_latent_InitNO']}) optimizes the initialization in the prior Gaussian distribution at $t=T$, our SAGA approach (\ref{['fig:interp_latent_ours']}) directly samples from a conditional Gaussian prior that approximates $p(\mathbf{z}_t|\mathbf{y})$ for a timestep $t<T$.
  • Figure 3: Only one of the cross-attention maps between the entities bear and bird (bottom-left) with the considered diffused latent $\mathbf{z}_t$ (illustrated at top-left by the corresponding final image estimation $\hat{\mathbf{z}}_0(\mathbf{z}_t, \mathbf{c}, t)$) is active before the process (left, with the standard SD 1.4) while both maps are active after our optimization procedure (right, with SAGA).
  • Figure 4: Effect of our rescaling mechanism on the generated images on SD 3 with SAGA, hyperparameters are identical.
  • Figure 5: Generated images across different methods using SD 1.4. Images in the same column are generated with the same seed.
  • ...and 25 more figures

Theorems & Definitions (5)

  • Proposition 1
  • Lemma 1
  • proof
  • Proposition 1
  • proof