Table of Contents
Fetching ...

Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering

Chang Yu, Junran Peng, Xiangyu Zhu, Zhaoxiang Zhang, Qi Tian, Zhen Lei

TL;DR

This paper addresses the challenge of generating accurate images from complex textual descriptions using diffusion models. It introduces a two-stage prompt-learning framework that learns input-specific prompts by leveraging quality and semantic guidance derived from pre-trained diffusion models, without updating the diffusion networks themselves. The method uses a coarse-to-fine denoising setup to steer prompt optimization, and employs loss terms that align text and image semantics while encouraging prompt sparsity. Results show improved text–image alignment for both composable and relational texts, with interpretable cross-attention evidence supporting the effectiveness of the learned prompts. The work demonstrates the potential of prompting-based adaptation of large diffusion models to handle complex linguistic inputs with reduced manual intervention.

Abstract

The text-to-image synthesis by diffusion models has recently shown remarkable performance in generating high-quality images. Although performs well for simple texts, the models may get confused when faced with complex texts that contain multiple objects or spatial relationships. To get the desired images, a feasible way is to manually adjust the textual descriptions, i.e., narrating the texts or adding some words, which is labor-consuming. In this paper, we propose a framework to learn the proper textual descriptions for diffusion models through prompt learning. By utilizing the quality guidance and the semantic guidance derived from the pre-trained diffusion model, our method can effectively learn the prompts to improve the matches between the input text and the generated images. Extensive experiments and analyses have validated the effectiveness of the proposed method.

Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering

TL;DR

This paper addresses the challenge of generating accurate images from complex textual descriptions using diffusion models. It introduces a two-stage prompt-learning framework that learns input-specific prompts by leveraging quality and semantic guidance derived from pre-trained diffusion models, without updating the diffusion networks themselves. The method uses a coarse-to-fine denoising setup to steer prompt optimization, and employs loss terms that align text and image semantics while encouraging prompt sparsity. Results show improved text–image alignment for both composable and relational texts, with interpretable cross-attention evidence supporting the effectiveness of the learned prompts. The work demonstrates the potential of prompting-based adaptation of large diffusion models to handle complex linguistic inputs with reduced manual intervention.

Abstract

The text-to-image synthesis by diffusion models has recently shown remarkable performance in generating high-quality images. Although performs well for simple texts, the models may get confused when faced with complex texts that contain multiple objects or spatial relationships. To get the desired images, a feasible way is to manually adjust the textual descriptions, i.e., narrating the texts or adding some words, which is labor-consuming. In this paper, we propose a framework to learn the proper textual descriptions for diffusion models through prompt learning. By utilizing the quality guidance and the semantic guidance derived from the pre-trained diffusion model, our method can effectively learn the prompts to improve the matches between the input text and the generated images. Extensive experiments and analyses have validated the effectiveness of the proposed method.
Paper Structure (17 sections, 10 equations, 11 figures)

This paper contains 17 sections, 10 equations, 11 figures.

Figures (11)

  • Figure 1: The text-to-image generation results of LDMs rombach2022stable with the same noise and the same random seed. It can be seen that the model performs well for short textual descriptions but degrades when the text becomes complex.
  • Figure 2: The generation results of LDMs rombach2022stable with different textual descriptions (same initial noise and same random seed). After carefully narrating the text or incorporating additional prompts, the model successfully synthesizes the images containing "an elephant and a bag".
  • Figure 3: The overall framework of our method. The core of the method is to learn input-specific proper prompts for each textual input so that the generated images match well with the given texts. Firstly, the random-sampled noise $x_0$ and the original text are sent to the diffusion model with $T_{coarse}$ and $T_{fine}$ denoising steps separately. Afterward, the texts concatenated with the prompts are re-sent to the diffusion models with $x_0$ to generate the final outputs. During training, the difference between the coarsely-sampled images and the finely-sampled images is used as quality guidance to constrain the learning of prompts. Besides, the words that have lower similarity with the generated images are masked as semantic guidance to further enhance prompt learning. After training with consistency and sparsity constraints for a few iterations, the proposed method can effectively seek out the prompts to improve the text-to-image synthesis.
  • Figure 4: The influence of the sampling steps on the results of diffusion models. It can be seen that the results under more sampling steps ($T_{fine}$) are of better 'quality' than the ones under fewer sampling steps ($T_{coarse}$). The 'quality' includes how the text matches the image and whether the image contains distortion or artifacts.
  • Figure 5: Illustration of the Quality Guidance. It incorporates the direction from the coarsely-sampled image to the finely-sampled image to guide the learning of the prompts.
  • ...and 6 more figures