Table of Contents
Fetching ...

Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

Shulei Wang, Wang Lin, Hai Huang, Hanting Wang, Sihang Cai, WenKang Han, Tao Jin, Jingyuan Chen, Jiacheng Sun, Jieming Zhu, Zhou Zhao

TL;DR

The paper tackles semantic misalignment in transformer-based text-guided diffusion models by introducing training-free Self-Coherence Guidance (SCG), which directly refines cross-attention maps during generation using masks derived from prior denoising steps. Unlike transferring U-Net alignment methods to transformers, SCG leverages the model’s own attention structure to improve coarse-grained, fine-grained, and style binding without additional training. It introduces robust benchmarks for coarse, fine, and style binding, and demonstrates state-of-the-art performance via qualitative, quantitative, and user studies, including generalization to other backbones like Flux. The work provides a scalable, training-free path to better-aligned TGDM outputs with practical implications for controllable image synthesis and multi-concept prompts.

Abstract

We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges. Previous U-Net-based methods primarily optimized the latent space, but their direct application to Transformer-based architectures has shown limited effectiveness. Our method addresses these challenges by directly optimizing cross-attention maps during the generation process. Specifically, we introduce Self-Coherence Guidance, a method that dynamically refines attention maps using masks derived from previous denoising steps, ensuring precise alignment without additional training. To validate our approach, we constructed more challenging benchmarks for evaluating coarse-grained attribute binding, fine-grained attribute binding, and style binding. Experimental results demonstrate the superior performance of our method, significantly surpassing other state-of-the-art methods across all evaluated tasks. Our code is available at https://scg-diffusion.github.io/scg-diffusion.

Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

TL;DR

The paper tackles semantic misalignment in transformer-based text-guided diffusion models by introducing training-free Self-Coherence Guidance (SCG), which directly refines cross-attention maps during generation using masks derived from prior denoising steps. Unlike transferring U-Net alignment methods to transformers, SCG leverages the model’s own attention structure to improve coarse-grained, fine-grained, and style binding without additional training. It introduces robust benchmarks for coarse, fine, and style binding, and demonstrates state-of-the-art performance via qualitative, quantitative, and user studies, including generalization to other backbones like Flux. The work provides a scalable, training-free path to better-aligned TGDM outputs with practical implications for controllable image synthesis and multi-concept prompts.

Abstract

We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges. Previous U-Net-based methods primarily optimized the latent space, but their direct application to Transformer-based architectures has shown limited effectiveness. Our method addresses these challenges by directly optimizing cross-attention maps during the generation process. Specifically, we introduce Self-Coherence Guidance, a method that dynamically refines attention maps using masks derived from previous denoising steps, ensuring precise alignment without additional training. To validate our approach, we constructed more challenging benchmarks for evaluating coarse-grained attribute binding, fine-grained attribute binding, and style binding. Experimental results demonstrate the superior performance of our method, significantly surpassing other state-of-the-art methods across all evaluated tasks. Our code is available at https://scg-diffusion.github.io/scg-diffusion.

Paper Structure

This paper contains 26 sections, 2 equations, 14 figures, 7 tables, 1 algorithm.

Figures (14)

  • Figure 1: Our method directly optimizes the cross-attention maps in Transformer-based diffusion models, significantly enhancing the model's performance in coarse-grained attribute binding and further improving fine-grained attribute and style binding. For instance, our approach enables precise control over the color of an apple’s flesh and stem as well as the style of two distinct concepts.
  • Figure 2: (a) Overview of our method. Given a prompt, we extract the corresponding concept masks and use these masks to directly guide the attribute or style maps. (b) For fine-grained attribute binding, we extract masks by planning the proportions using LLMs. (c) For coarse-grained attribute binding and style binding, we directly apply clustering methods to extract the corresponding masks.
  • Figure 3: Qualitative analysis of our method compared to other SOTA methods.Our approach consistently generates high-quality images with superior alignment across coarse-grained attribute binding, fine-grained attribute binding, and style binding tasks.
  • Figure 4: Qualitative results of directly transferring D&B and CONFORM methods to Transformer-based architectures.
  • Figure 5: Generation results of U-Net-based CONFORM using cross-attention maps at different resolutions,where the results at corresponding positions are generated using the same random seed.The text prompt is "a purple dog and a green bench".
  • ...and 9 more figures