Table of Contents
Fetching ...

Prompt Recovery for Image Generation Models: A Comparative Study of Discrete Optimizers

Joshua Nathaniel Williams, Avi Schwarzschild, Yutong He, J. Zico Kolter

TL;DR

This work benchmarks discrete optimization methods for prompt inversion in image-generation models, comparing PEZ, Greedy Coordinate Gradients, AutoDAN, Random Search, PRISM, and BLIP-2 captioning. It reveals that CLIP-based objectives can be a poor proxy for final image fidelity, while a captioning-based inversion often yields more faithful images and human-friendly prompts. PRISM, which optimizes a distribution of prompts via in-context learning, and BLIP-2 captioning emerge as particularly effective, though the results are sensitive to model choices and evaluation metrics. The study provides a structured, multi-metric benchmark and highlights practical implications for prompt recovery and understanding the prompt-image mapping in diffusion-based generation systems.

Abstract

Recovering natural language prompts for image generation models, solely based on the generated images is a difficult discrete optimization problem. In this work, we present the first head-to-head comparison of recent discrete optimization techniques for the problem of prompt inversion. We evaluate Greedy Coordinate Gradients (GCG), PEZ , Random Search, AutoDAN and BLIP2's image captioner across various evaluation metrics related to the quality of inverted prompts and the quality of the images generated by the inverted prompts. We find that focusing on the CLIP similarity between the inverted prompts and the ground truth image acts as a poor proxy for the similarity between ground truth image and the image generated by the inverted prompts. While the discrete optimizers effectively minimize their objectives, simply using responses from a well-trained captioner often leads to generated images that more closely resemble those produced by the original prompts.

Prompt Recovery for Image Generation Models: A Comparative Study of Discrete Optimizers

TL;DR

This work benchmarks discrete optimization methods for prompt inversion in image-generation models, comparing PEZ, Greedy Coordinate Gradients, AutoDAN, Random Search, PRISM, and BLIP-2 captioning. It reveals that CLIP-based objectives can be a poor proxy for final image fidelity, while a captioning-based inversion often yields more faithful images and human-friendly prompts. PRISM, which optimizes a distribution of prompts via in-context learning, and BLIP-2 captioning emerge as particularly effective, though the results are sensitive to model choices and evaluation metrics. The study provides a structured, multi-metric benchmark and highlights practical implications for prompt recovery and understanding the prompt-image mapping in diffusion-based generation systems.

Abstract

Recovering natural language prompts for image generation models, solely based on the generated images is a difficult discrete optimization problem. In this work, we present the first head-to-head comparison of recent discrete optimization techniques for the problem of prompt inversion. We evaluate Greedy Coordinate Gradients (GCG), PEZ , Random Search, AutoDAN and BLIP2's image captioner across various evaluation metrics related to the quality of inverted prompts and the quality of the images generated by the inverted prompts. We find that focusing on the CLIP similarity between the inverted prompts and the ground truth image acts as a poor proxy for the similarity between ground truth image and the image generated by the inverted prompts. While the discrete optimizers effectively minimize their objectives, simply using responses from a well-trained captioner often leads to generated images that more closely resemble those produced by the original prompts.
Paper Structure (18 sections, 3 equations, 3 figures, 1 table)

This paper contains 18 sections, 3 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Comparison between images generated by inverted prompts and images generated by the original prompts.
  • Figure 2: CLIP Similarity between the inverted prompt and images generated by the original prompt. This CLIP Similarity is the objective that each optimizer is maximizing.
  • Figure 3: Cosine Similarity between text embeddings for the original and inverted prompts. Based on the metric used by kaggle-image-to-prompts