ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models
Jianyi Zhang, Yufan Zhou, Jiuxiang Gu, Curtis Wigington, Tong Yu, Yiran Chen, Tong Sun, Ruiyi Zhang
TL;DR
ARTIST addresses the challenge of rendering legible, well-integrated text in diffusion-generated images by introducing a disentangled, two-stage framework: a textual diffusion module that learns text structure and a visual diffusion module that learns appearance, guided by an LLM-based prompt understanding component. The method trains the text module on large synthetic word- and sentence-level datasets and then trains the visual module with injected features from the text module, using diffusion losses. Empirical results on MARIO-Eval and ARTIST-Eval demonstrate substantial gains in OCR accuracy, CLIP alignment, and layout fidelity, with additional improvements when leveraging LLM prompting. This approach automates keyword extraction and layout planning, enabling more accurate text rendering in posters, book covers, and other text-rich imagery, and offers a scalable path for future improvements in text-aware image synthesis.
Abstract
Diffusion models have demonstrated exceptional capabilities in generating a broad spectrum of visual content, yet their proficiency in rendering text is still limited: they often generate inaccurate characters or words that fail to blend well with the underlying image. To address these shortcomings, we introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically. Initially, we pretrain this textual model to capture the intricacies of text representation. Subsequently, we finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model. This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation. Additionally, we leverage the capabilities of pretrained large language models to interpret user intentions better, contributing to improved generation quality. Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
