Table of Contents
Fetching ...

Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

Yutong He, Alexander Robey, Naoki Murata, Yiding Jiang, Joshua Nathaniel Williams, George J. Pappas, Hamed Hassani, Yuki Mitsufuji, Ruslan Salakhutdinov, J. Zico Kolter

TL;DR

PRISM tackles the labor-intensive, model-specific nature of prompt engineering in text-to-image generation by learning a transferable, human-interpretable prompt distribution using Vision-Language Models as both prompt engineers and judges, with iterative, in-context refinement guided by a black-box T2I generator. The approach yields prompts that consistently transfer across open and closed T2I models and improves interpretability and safety compared with baselines like Textual Inversion, BLIP-2, CLIP-Interrogator, and PEZ. Through extensive experiments on personalized T2I tasks (DreamBooth) and art styles (WikiArt), plus direct image inversion, PRISM demonstrates robust performance gains, meaningful ablations, and practical applications such as prompt editing, multi-concept composition, and prompt distillation. The work highlights the value of integrating LLM-driven reasoning into image generation pipelines, offering cost-flexible, scalable, and safer automated prompt design for diverse T2I platforms.

Abstract

Prompt engineering is an effective but labor-intensive way to control text-to-image (T2I) generative models. Its time-intensive nature and complexity have spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transferability across T2I models, require white-box access to the underlying model, or produce non-intuitive prompts. In this work, we introduce PRISM, an algorithm that automatically produces human-interpretable and transferable prompts that can effectively generate desired concepts given only black-box access to T2I models. Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompt distribution built upon the reference images. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles, and images across multiple T2I models, including Stable Diffusion, DALL-E, and Midjourney.

Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

TL;DR

PRISM tackles the labor-intensive, model-specific nature of prompt engineering in text-to-image generation by learning a transferable, human-interpretable prompt distribution using Vision-Language Models as both prompt engineers and judges, with iterative, in-context refinement guided by a black-box T2I generator. The approach yields prompts that consistently transfer across open and closed T2I models and improves interpretability and safety compared with baselines like Textual Inversion, BLIP-2, CLIP-Interrogator, and PEZ. Through extensive experiments on personalized T2I tasks (DreamBooth) and art styles (WikiArt), plus direct image inversion, PRISM demonstrates robust performance gains, meaningful ablations, and practical applications such as prompt editing, multi-concept composition, and prompt distillation. The work highlights the value of integrating LLM-driven reasoning into image generation pipelines, offering cost-flexible, scalable, and safer automated prompt design for diverse T2I platforms.

Abstract

Prompt engineering is an effective but labor-intensive way to control text-to-image (T2I) generative models. Its time-intensive nature and complexity have spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transferability across T2I models, require white-box access to the underlying model, or produce non-intuitive prompts. In this work, we introduce PRISM, an algorithm that automatically produces human-interpretable and transferable prompts that can effectively generate desired concepts given only black-box access to T2I models. Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompt distribution built upon the reference images. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles, and images across multiple T2I models, including Stable Diffusion, DALL-E, and Midjourney.
Paper Structure (61 sections, 1 equation, 27 figures, 10 tables, 1 algorithm)

This paper contains 61 sections, 1 equation, 27 figures, 10 tables, 1 algorithm.

Figures (27)

  • Figure 1: Given a set of reference images, our method, PRISM, is capable of creating human-interpretable and accurate prompts for the desired concept that are also transferable to both open-sourced and closed-sourced text-to-image models. $\bigoplus$ denotes prompt concatenation.
  • Figure 2: An illustration of PRISM. "System" indicates the system prompt setups for the VLMs.
  • Figure 3: Qualitative results for personalized T2I generation on DreamBooth dataset.
  • Figure 3: Comparison with GPT-4V in both personalized T2I generation and direct image inversion experiments.
  • Figure 4: Qualitative results for personalized style T2I generation on Wikiart dataset.
  • ...and 22 more figures