Table of Contents
Fetching ...

Reverse Stable Diffusion: What prompt was used to generate this image?

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Mubarak Shah

TL;DR

An interesting discovery: training a diffusion model on the prompt generation task can make the model generate images that are much better aligned with the input prompts, when the model is directly reused for text-to-image generation.

Abstract

Text-to-image diffusion models have recently attracted the interest of many researchers, and inverting the diffusion process can play an important role in better understanding the generative process and how to engineer prompts in order to obtain the desired images. To this end, we study the task of predicting the prompt embedding given an image generated by a generative diffusion model. We consider a series of white-box and black-box models (with and without access to the weights of the diffusion network) to deal with the proposed task. We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective that generates improved prompts. To further improve our method, we employ a curriculum learning procedure that promotes the learning of image-prompt pairs with lower labeling noise (i.e. that are better aligned). We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion. In addition, we make an interesting discovery: training a diffusion model on the prompt generation task can make the model generate images that are much better aligned with the input prompts, when the model is directly reused for text-to-image generation. Our code is publicly available for download at https://github.com/CroitoruAlin/Reverse-Stable-Diffusion.

Reverse Stable Diffusion: What prompt was used to generate this image?

TL;DR

An interesting discovery: training a diffusion model on the prompt generation task can make the model generate images that are much better aligned with the input prompts, when the model is directly reused for text-to-image generation.

Abstract

Text-to-image diffusion models have recently attracted the interest of many researchers, and inverting the diffusion process can play an important role in better understanding the generative process and how to engineer prompts in order to obtain the desired images. To this end, we study the task of predicting the prompt embedding given an image generated by a generative diffusion model. We consider a series of white-box and black-box models (with and without access to the weights of the diffusion network) to deal with the proposed task. We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective that generates improved prompts. To further improve our method, we employ a curriculum learning procedure that promotes the learning of image-prompt pairs with lower labeling noise (i.e. that are better aligned). We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion. In addition, we make an interesting discovery: training a diffusion model on the prompt generation task can make the model generate images that are much better aligned with the input prompts, when the model is directly reused for text-to-image generation. Our code is publicly available for download at https://github.com/CroitoruAlin/Reverse-Stable-Diffusion.
Paper Structure (17 sections, 7 equations, 11 figures, 11 tables)

This paper contains 17 sections, 7 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Our learning framework for prompt embedding estimation, along with its vocabulary classification task. We transform the input prompts via a sentence transformer for the embedding estimation task and we use a vocabulary of the most common words to create the target vectors for the classification task. Best viewed in color.
  • Figure 2: Comparison between the image captioning and prompt prediction tasks. The samples from the top row are taken from MS COCO, while the samples from the bottom row are taken from DiffusionDB. In prompt prediction, the model must generate a single and very detailed text prompt. In contrast, image captioning benchmarks typically have several alternative ground-truth captions for each image Chen-Arxiv-2015, and models are evaluated against the best matching ground-truth caption. Moreover, image captions are generally shorter, referring only to the foreground objects and their interactions. Best viewed in color.
  • Figure 3: Configurations for the classification and embedding prediction heads. In the first configuration, the heads are separate, being fed with the same features. In the second configuration, the output of the classification head is concatenated with the image encoding to create the final intake for the embedding prediction head. In the third configuration, the classification is carried out using the predicted embedding as input.
  • Figure 4: Examples of captions for generated images. We compare the prompts returned by a fine-tuned vanilla BLIP with those of an enhanced version of BLIP based on multi-label classification (MLC) and curriculum learning (CL). Best viewed in color.
  • Figure 5: Samples generated by original and modified Stable Diffusion models. The images on the middle row are synthesized by the original U-Net. The images on the bottom row are generated by replacing (from the second diffusion step onward) the original U-Net encoder with our U-Net encoder employed for prompt embedding prediction. Notable differences include the presence of a horse and rabbits in the bottom images, while they are absent in the top ones (first and second column). Our model also corrects errors like the orientation of a person (last column) and the height of characters like Bilbo, recognizing him as a hobbit and adjusting his height accordingly. Best viewed in color.
  • ...and 6 more figures