Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering
Chang Yu, Junran Peng, Xiangyu Zhu, Zhaoxiang Zhang, Qi Tian, Zhen Lei
TL;DR
This paper addresses the challenge of generating accurate images from complex textual descriptions using diffusion models. It introduces a two-stage prompt-learning framework that learns input-specific prompts by leveraging quality and semantic guidance derived from pre-trained diffusion models, without updating the diffusion networks themselves. The method uses a coarse-to-fine denoising setup to steer prompt optimization, and employs loss terms that align text and image semantics while encouraging prompt sparsity. Results show improved text–image alignment for both composable and relational texts, with interpretable cross-attention evidence supporting the effectiveness of the learned prompts. The work demonstrates the potential of prompting-based adaptation of large diffusion models to handle complex linguistic inputs with reduced manual intervention.
Abstract
The text-to-image synthesis by diffusion models has recently shown remarkable performance in generating high-quality images. Although performs well for simple texts, the models may get confused when faced with complex texts that contain multiple objects or spatial relationships. To get the desired images, a feasible way is to manually adjust the textual descriptions, i.e., narrating the texts or adding some words, which is labor-consuming. In this paper, we propose a framework to learn the proper textual descriptions for diffusion models through prompt learning. By utilizing the quality guidance and the semantic guidance derived from the pre-trained diffusion model, our method can effectively learn the prompts to improve the matches between the input text and the generated images. Extensive experiments and analyses have validated the effectiveness of the proposed method.
