SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation

Paul Grimal; Michaël Soumm; Hervé Le Borgne; Olivier Ferret; Akihiro Sugimoto

SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation

Paul Grimal, Michaël Soumm, Hervé Le Borgne, Olivier Ferret, Akihiro Sugimoto

TL;DR

SAGA reframes alignment in text-to-image generation as learning a signal-aligned Gaussian over intermediate latents conditioned on the prompt. By approximating $p(\mathbf{z}_t|\mathbf{y})$ with a Gaussian centered at $a_t\tilde{\boldsymbol{\mu}}_{\mathbf{y}}$ and stabilizing the signal through rescaling, it enables training-free sampling of multiple high-fidelity, prompt-consistent images, even with bounding-box conditioning. A cross-modal attention criterion and an inference-time optimization strategy selectively adjust the latent distribution to reflect the target semantics while preserving in-distribution behavior. Empirically, SAGA and its variants outperform strong training-free baselines on SD 1.4 and SD 3 across text and layout-conditioned tasks, with user studies corroborating improved semantic alignment and image quality. The approach offers practical benefits for real-world deployment by enabling efficient, diverse, high-quality generations without retraining the underlying diffusion/flow models.

Abstract

State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities -- such as bounding boxes -- for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. The code is available at https://github.com/grimalPaul/gsn-factory.

SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation

TL;DR

SAGA reframes alignment in text-to-image generation as learning a signal-aligned Gaussian over intermediate latents conditioned on the prompt. By approximating

with a Gaussian centered at

and stabilizing the signal through rescaling, it enables training-free sampling of multiple high-fidelity, prompt-consistent images, even with bounding-box conditioning. A cross-modal attention criterion and an inference-time optimization strategy selectively adjust the latent distribution to reflect the target semantics while preserving in-distribution behavior. Empirically, SAGA and its variants outperform strong training-free baselines on SD 1.4 and SD 3 across text and layout-conditioned tasks, with user studies corroborating improved semantic alignment and image quality. The approach offers practical benefits for real-world deployment by enabling efficient, diverse, high-quality generations without retraining the underlying diffusion/flow models.

SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation

TL;DR

Abstract

SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (30)

Theorems & Definitions (5)