Table of Contents
Fetching ...

DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

Jaewoo Song, Jooyoung Choi, Kanghyun Baek, Sangyub Lee, Daemin Park, Sungroh Yoon

TL;DR

This work tackles the difficulty of rendering long or multiple text strings in visual generation by diluting attention across the whole image. It introduces DCText, a training-free divide-and-conquer framework that decomposes global prompts into region-specific prompts and steers diffusion denoising with two scheduled masks (Text-Focus and Context-Expansion) plus Localized Noise Initialization. The approach achieves higher text accuracy while preserving image quality and maintaining lower latency across single- and multi-sentence benchmarks, demonstrating efficient, region-aware control over text layout. The method generalizes across MM-DiT backbones and shows strong ablation support for the mask design and initialization strategy, indicating practical impact for robust visual text generation in diverse scenes.

Abstract

Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.

DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

TL;DR

This work tackles the difficulty of rendering long or multiple text strings in visual generation by diluting attention across the whole image. It introduces DCText, a training-free divide-and-conquer framework that decomposes global prompts into region-specific prompts and steers diffusion denoising with two scheduled masks (Text-Focus and Context-Expansion) plus Localized Noise Initialization. The approach achieves higher text accuracy while preserving image quality and maintaining lower latency across single- and multi-sentence benchmarks, demonstrating efficient, region-aware control over text layout. The method generalizes across MM-DiT backbones and shows strong ablation support for the mask design and initialization strategy, indicating practical impact for robust visual text generation in diverse scenes.

Abstract

Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.

Paper Structure

This paper contains 41 sections, 11 equations, 18 figures, 16 tables, 1 algorithm.

Figures (18)

  • Figure 1: Given a global prompt and target regions (red boxes), DCText decomposes the target text (highlighted in red) and assigns it to regions, enabling accurate and coherent visual text generation, which the base Flux flux1-dev model struggles to handle reliably. The prompts below each image are abbreviated from the original global prompts (full prompts in Appendix \ref{['sec-e3:Abbreviated Prompts']}).
  • Figure 2:
  • Figure 3: Qualitative comparison on single sentence. Prompts, including the sentence to be rendered (highlighted in red), are shown below each column. All comparisons are generated from the same initial noise.
  • Figure 4: Qualitative comparison on multiple sentences. Comparison of generation results with varying numbers of sentences (2–5) in a single prompt. Sentences and corresponding regions are highlighted in red (only target texts are shown for the two prompts below; full prompts are in Appendix \ref{['sec-e3:Abbreviated Prompts']}). Our method consistently renders accurate text in the correct regions.
  • Figure 5: Human Evaluation. User preference on text accuracy, prompt–image alignment, and overall image quality. Green bars indicate cases where our method is preferred.
  • ...and 13 more figures