Table of Contents
Fetching ...

PEO: Training-Free Aesthetic Quality Enhancement in Pre-Trained Text-to-Image Diffusion Models with Prompt Embedding Optimization

Hovhannes Margaryan, Bo Wan, Tinne Tuytelaars

TL;DR

The paper addresses the challenge of producing high-aesthetic images from simple prompts in pre-trained text-to-image diffusion models. It introduces Prompt Embedding Optimization (PEO), a training-free, backbone-agnostic method that optimizes the prompt embedding $ heta$ by maximizing a tripartite objective $ ext{L}_{PEO} = oldsymbol{ ho}_1 ext{L}_1 + oldsymbol{ ho}_2 ext{L}_2 + oldsymbol{ ho}_3 ext{L}_{PPT}$, where $ ext{L}_1$ uses LAION-AesPredv2 for visual quality, $ ext{L}_2$ enforces text-image alignment via CLIP features, and $ ext{L}_{PPT}$ is a Prompt Preservation Term keeping $ heta$ close to the initial embedding $ heta_{init} = ext{E}_T(P)$. The method updates the embedding with gradient-based optimization (e.g., Adam) while keeping the diffusion backbone fixed and integrating with classifier-free guidance. Experiments on SD-v1-5 and SDXL Turbo show that PEO yields superior or comparable aesthetic quality to state-of-the-art baselines and human preferences favor PEO, while preserving alignment with the original prompt. These results imply a practical, training-free route to elevate image quality from simple prompts without costly fine-tuning or complex prompt engineering.

Abstract

This paper introduces a novel approach to aesthetic quality improvement in pre-trained text-to-image diffusion models when given a simple prompt. Our method, dubbed Prompt Embedding Optimization (PEO), leverages a pre-trained text-to-image diffusion model as a backbone and optimizes the text embedding of a given simple and uncurated prompt to enhance the visual quality of the generated image. We achieve this by a tripartite objective function that improves the aesthetic fidelity of the generated image, ensures adherence to the optimized text embedding, and minimal divergence from the initial prompt. The latter is accomplished through a prompt preservation term. Additionally, PEO is training-free and backbone-independent. Quantitative and qualitative evaluations confirm the effectiveness of the proposed method, exceeding or equating the performance of state-of-the-art text-to-image and prompt adaptation methods.

PEO: Training-Free Aesthetic Quality Enhancement in Pre-Trained Text-to-Image Diffusion Models with Prompt Embedding Optimization

TL;DR

The paper addresses the challenge of producing high-aesthetic images from simple prompts in pre-trained text-to-image diffusion models. It introduces Prompt Embedding Optimization (PEO), a training-free, backbone-agnostic method that optimizes the prompt embedding by maximizing a tripartite objective , where uses LAION-AesPredv2 for visual quality, enforces text-image alignment via CLIP features, and is a Prompt Preservation Term keeping close to the initial embedding . The method updates the embedding with gradient-based optimization (e.g., Adam) while keeping the diffusion backbone fixed and integrating with classifier-free guidance. Experiments on SD-v1-5 and SDXL Turbo show that PEO yields superior or comparable aesthetic quality to state-of-the-art baselines and human preferences favor PEO, while preserving alignment with the original prompt. These results imply a practical, training-free route to elevate image quality from simple prompts without costly fine-tuning or complex prompt engineering.

Abstract

This paper introduces a novel approach to aesthetic quality improvement in pre-trained text-to-image diffusion models when given a simple prompt. Our method, dubbed Prompt Embedding Optimization (PEO), leverages a pre-trained text-to-image diffusion model as a backbone and optimizes the text embedding of a given simple and uncurated prompt to enhance the visual quality of the generated image. We achieve this by a tripartite objective function that improves the aesthetic fidelity of the generated image, ensures adherence to the optimized text embedding, and minimal divergence from the initial prompt. The latter is accomplished through a prompt preservation term. Additionally, PEO is training-free and backbone-independent. Quantitative and qualitative evaluations confirm the effectiveness of the proposed method, exceeding or equating the performance of state-of-the-art text-to-image and prompt adaptation methods.

Paper Structure

This paper contains 17 sections, 4 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Two images generated with SD-v1-5 Rombach_2022_CVPR. (a) is generated with a simple and uncurated prompt: "photo of a girl" and lacks intricacies. (b) is generated with a carefully designed prompt: "1girl, 8k resolution, photorealistic masterpiece by Aaron Horkey and Jeremy Mann, intricately detailed fluid gouache painting by Jean Baptiste, professional photography, natural lighting, volumetric lighting, maximalist, 8k resolution, concept art, intricately detailed, complex, elegant, expansive, fantastical, cover" and is visually appealing and highly detailed. Prompt for (b) is taken from Fotor .
  • Figure 2: The optimization framework of the proposed PEO method. PEO optimizes the embedding of the given prompt using $\mathcal{L}_{PEO}$ as an objective function which takes into account the aesthetic quality of the generated image, the distance between the text embedding being optimized and features of the generated image in CLIP's space, and the similarity between the optimized and initial text embeddings. While the visualization is in pixel space, we work in latent space.
  • Figure 3: The proposed PEO optimization method in image space, with $t$ as the current step. At $t=10$, the aesthetic quality of the generated image and text-to-image relevance are improved compared to $t=0$. The bottom-right number is the LAION-AesPredv2 value.
  • Figure 4: Qualitative comparison of PEO and baselines with SD-v1-5 and SDXL Turbo as backbones. Our approach surpasses the baseline in visual aesthetic quality, exhibits improved details, and better alignment with the style and main subject indicated in the original prompt.
  • Figure 5: Results of the user study. The annotators favored PEO by at least 11.23% over SD-v1-5 and 9.85% over Promptist.
  • ...and 8 more figures