Table of Contents
Fetching ...

TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

Tianyi Liang, Jiangqi Liu, Yifei Huang, Shiqi Jiang, Jianshen Shi, Changbo Wang, Chenhui Li

TL;DR

TextCenGen tackles the challenge of generating backgrounds that gracefully accommodate text within text-to-image generation. It introduces a training-free pipeline that uses cross-attention maps to relocate conflicting objects away from a planned blank region, combined with a force-directed guidance and a spatial-excluding constraint to maintain background smoothness and semantic fidelity. The approach yields a strong balance between text-area readability and prompt alignment, validated on a large, diverse benchmark with favorable saliency, CLIP fidelity, and a novel VTCM metric, while also receiving favorable human and MLLM judgments. This work offers a practical, plug-and-play method to produce text-friendly images without additional training, with potential for broad application in graphic design and UI imagery generation.

Abstract

Text-to-image (T2I) generation has made remarkable progress in producing high-quality images, but a fundamental challenge remains: creating backgrounds that naturally accommodate text placement without compromising image quality. This capability is non-trivial for real-world applications like graphic design, where clear visual hierarchy between content and text is essential. Prior work has primarily focused on arranging layouts within existing static images, leaving unexplored the potential of T2I models for generating text-friendly backgrounds. We present TextCenGen, a training-free dynamic background adaptation in the blank region for text-friendly image generation. Instead of directly reducing attention in text areas, which degrades image quality, we relocate conflicting objects before background optimization. Our method analyzes cross-attention maps to identify conflicting objects overlapping with text regions and uses a force-directed graph approach to guide their relocation, followed by attention excluding constraints to ensure smooth backgrounds. Our method is plug-and-play, requiring no additional training while well balancing both semantic fidelity and visual quality. Evaluated on our proposed text-friendly T2I benchmark of 27,000 images across four seed datasets, TextCenGen outperforms existing methods by achieving 23% lower saliency overlap in text regions while maintaining 98% of the semantic fidelity measured by CLIP score and our proposed Visual-Textual Concordance Metric (VTCM).

TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

TL;DR

TextCenGen tackles the challenge of generating backgrounds that gracefully accommodate text within text-to-image generation. It introduces a training-free pipeline that uses cross-attention maps to relocate conflicting objects away from a planned blank region, combined with a force-directed guidance and a spatial-excluding constraint to maintain background smoothness and semantic fidelity. The approach yields a strong balance between text-area readability and prompt alignment, validated on a large, diverse benchmark with favorable saliency, CLIP fidelity, and a novel VTCM metric, while also receiving favorable human and MLLM judgments. This work offers a practical, plug-and-play method to produce text-friendly images without additional training, with potential for broad application in graphic design and UI imagery generation.

Abstract

Text-to-image (T2I) generation has made remarkable progress in producing high-quality images, but a fundamental challenge remains: creating backgrounds that naturally accommodate text placement without compromising image quality. This capability is non-trivial for real-world applications like graphic design, where clear visual hierarchy between content and text is essential. Prior work has primarily focused on arranging layouts within existing static images, leaving unexplored the potential of T2I models for generating text-friendly backgrounds. We present TextCenGen, a training-free dynamic background adaptation in the blank region for text-friendly image generation. Instead of directly reducing attention in text areas, which degrades image quality, we relocate conflicting objects before background optimization. Our method analyzes cross-attention maps to identify conflicting objects overlapping with text regions and uses a force-directed graph approach to guide their relocation, followed by attention excluding constraints to ensure smooth backgrounds. Our method is plug-and-play, requiring no additional training while well balancing both semantic fidelity and visual quality. Evaluated on our proposed text-friendly T2I benchmark of 27,000 images across four seed datasets, TextCenGen outperforms existing methods by achieving 23% lower saliency overlap in text regions while maintaining 98% of the semantic fidelity measured by CLIP score and our proposed Visual-Textual Concordance Metric (VTCM).
Paper Structure (36 sections, 7 equations, 13 figures, 10 tables)

This paper contains 36 sections, 7 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: TextCenGen is a training-free method designed to generate text-friendly images. By using a simple text prompt and a planned blank region as inputs, TextCenGen creates images that satisfy the prompt and provide sufficient blank space in the target region. For example, the text-friendly T2I approach helps users customize their favored text-friendly wallpapers for mobile devices with any T2I model, avoiding visual confusion caused by the main objects overlapping with UI components.
  • Figure 2: In our approach, the model receives a blank region ($R$) denoted as red-dotted area, and a text prompt as its inputs. The prompt is then used concurrently in a Text-to-Image (T2I) model to generate both an original image and a result image. During each step of the diffusion model's denoising process, the cross-attention map from the U-Net associated with the original image is used to direct the denoising of the result image in the form of a loss function. Throughout this procedure, a conflict detector identifies objects that could potentially conflict with $R$. To mitigate such conflicts, a force-directed graph method is applied to spatially repel these objects, ensuring that the area reserved for the text prompt remains unoccupied. To further enhance the smoothness of the attention mechanism, a spatial excluding cross-attention constraint is integrated into the cross-attention map.
  • Figure 3: Illustration of four set relationships and their associated forces. The Repulsive Force separates object and text centroids during intersections (a1) and object in text (a2). The Margin Force (b) and Warping Force (c) prevent boundary overstepping. Text within object regions (a4) requires cooperation between force and attention constraint. Separation (a3) isn't required to process.
  • Figure 4: The results of comparison. Each column showcases six prompts across three datasets, the final column depicting the saliency map of the result image generated from the mushroom prompt. The red-dotted area denotes the planned blank region. Note that some methods fail to follow the orange-highlighted words in the prompt, leading to semantic loss.
  • Figure 5: Performance trade-offs between different metrics. The dashed lines represent iso-utility curves, where points on the same curve indicate equivalent trade-off levels. Our method achieves a better balance between background smoothness and semantic fidelity. Higher utility curves (green) represent better overall performance.
  • ...and 8 more figures