TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation
Tianyi Liang, Jiangqi Liu, Yifei Huang, Shiqi Jiang, Jianshen Shi, Changbo Wang, Chenhui Li
TL;DR
TextCenGen tackles the challenge of generating backgrounds that gracefully accommodate text within text-to-image generation. It introduces a training-free pipeline that uses cross-attention maps to relocate conflicting objects away from a planned blank region, combined with a force-directed guidance and a spatial-excluding constraint to maintain background smoothness and semantic fidelity. The approach yields a strong balance between text-area readability and prompt alignment, validated on a large, diverse benchmark with favorable saliency, CLIP fidelity, and a novel VTCM metric, while also receiving favorable human and MLLM judgments. This work offers a practical, plug-and-play method to produce text-friendly images without additional training, with potential for broad application in graphic design and UI imagery generation.
Abstract
Text-to-image (T2I) generation has made remarkable progress in producing high-quality images, but a fundamental challenge remains: creating backgrounds that naturally accommodate text placement without compromising image quality. This capability is non-trivial for real-world applications like graphic design, where clear visual hierarchy between content and text is essential. Prior work has primarily focused on arranging layouts within existing static images, leaving unexplored the potential of T2I models for generating text-friendly backgrounds. We present TextCenGen, a training-free dynamic background adaptation in the blank region for text-friendly image generation. Instead of directly reducing attention in text areas, which degrades image quality, we relocate conflicting objects before background optimization. Our method analyzes cross-attention maps to identify conflicting objects overlapping with text regions and uses a force-directed graph approach to guide their relocation, followed by attention excluding constraints to ensure smooth backgrounds. Our method is plug-and-play, requiring no additional training while well balancing both semantic fidelity and visual quality. Evaluated on our proposed text-friendly T2I benchmark of 27,000 images across four seed datasets, TextCenGen outperforms existing methods by achieving 23% lower saliency overlap in text regions while maintaining 98% of the semantic fidelity measured by CLIP score and our proposed Visual-Textual Concordance Metric (VTCM).
