Table of Contents
Fetching ...

Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

Tianyi Wei, Dongdong Chen, Yifan Zhou, Xingang Pan

TL;DR

This work designs three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate ambiguities within the MMDiT architecture that cause semantic ambiguity persists when generating multiple similar subjects.

Abstract

Representing the cutting-edge technique of text-to-image models, the latest Multimodal Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous models. However, we discover that it still suffers from subject neglect or mixing when the input text prompt contains multiple subjects of similar semantics or appearance. We identify three possible ambiguities within the MMDiT architecture that cause this problem: Inter-block Ambiguity, Text Encoder Ambiguity, and Semantic Ambiguity. To address these issues, we propose to repair the ambiguous latent on-the-fly by test-time optimization at early denoising steps. In detail, we design three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate these ambiguities. Despite significant improvements, we observe that semantic ambiguity persists when generating multiple similar subjects, as the guidance provided by overlap loss is not explicit enough. Therefore, we further propose Overlap Online Detection and Back-to-Start Sampling Strategy to alleviate the problem. Experimental results on a newly constructed challenging dataset of similar subjects validate the effectiveness of our approach, showing superior generation quality and much higher success rates over existing methods. Our code will be available at https://github.com/wtybest/EnMMDiT.

Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

TL;DR

This work designs three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate ambiguities within the MMDiT architecture that cause semantic ambiguity persists when generating multiple similar subjects.

Abstract

Representing the cutting-edge technique of text-to-image models, the latest Multimodal Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous models. However, we discover that it still suffers from subject neglect or mixing when the input text prompt contains multiple subjects of similar semantics or appearance. We identify three possible ambiguities within the MMDiT architecture that cause this problem: Inter-block Ambiguity, Text Encoder Ambiguity, and Semantic Ambiguity. To address these issues, we propose to repair the ambiguous latent on-the-fly by test-time optimization at early denoising steps. In detail, we design three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate these ambiguities. Despite significant improvements, we observe that semantic ambiguity persists when generating multiple similar subjects, as the guidance provided by overlap loss is not explicit enough. Therefore, we further propose Overlap Online Detection and Back-to-Start Sampling Strategy to alleviate the problem. Experimental results on a newly constructed challenging dataset of similar subjects validate the effectiveness of our approach, showing superior generation quality and much higher success rates over existing methods. Our code will be available at https://github.com/wtybest/EnMMDiT.

Paper Structure

This paper contains 21 sections, 6 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: Our approach can effectively mitigate the subject neglect or mixing issues suffered by SD3 for similar subject generation.
  • Figure 2: Illustration for joint self-attention of Stable Diffusion 3.
  • Figure 3: Three types of ambiguities present in the SD3 generation process, including: inter-block ambiguity, text encoder ambiguity, and semantic ambiguity. All cross-attention maps are from the $5$th step of denoising ($28$ steps in total).
  • Figure 4: Illustration of three losses for mitigating ambiguities. Left: cross-attention maps from step 5 of Stable Diffusion 3; Right: cross-attention maps from step 5 after imposing losses. Obviously, our cross-attention maps demonstrate strong consistency between blocks and between text encoders with no overlap across subjects.
  • Figure 5: Overview of overlap online detection and back-to-start sampling strategy. Prompt: "a dog and a cat and a rabbit".
  • ...and 9 more figures