PEO: Training-Free Aesthetic Quality Enhancement in Pre-Trained Text-to-Image Diffusion Models with Prompt Embedding Optimization
Hovhannes Margaryan, Bo Wan, Tinne Tuytelaars
TL;DR
The paper addresses the challenge of producing high-aesthetic images from simple prompts in pre-trained text-to-image diffusion models. It introduces Prompt Embedding Optimization (PEO), a training-free, backbone-agnostic method that optimizes the prompt embedding $ heta$ by maximizing a tripartite objective $ ext{L}_{PEO} = oldsymbol{ ho}_1 ext{L}_1 + oldsymbol{ ho}_2 ext{L}_2 + oldsymbol{ ho}_3 ext{L}_{PPT}$, where $ ext{L}_1$ uses LAION-AesPredv2 for visual quality, $ ext{L}_2$ enforces text-image alignment via CLIP features, and $ ext{L}_{PPT}$ is a Prompt Preservation Term keeping $ heta$ close to the initial embedding $ heta_{init} = ext{E}_T(P)$. The method updates the embedding with gradient-based optimization (e.g., Adam) while keeping the diffusion backbone fixed and integrating with classifier-free guidance. Experiments on SD-v1-5 and SDXL Turbo show that PEO yields superior or comparable aesthetic quality to state-of-the-art baselines and human preferences favor PEO, while preserving alignment with the original prompt. These results imply a practical, training-free route to elevate image quality from simple prompts without costly fine-tuning or complex prompt engineering.
Abstract
This paper introduces a novel approach to aesthetic quality improvement in pre-trained text-to-image diffusion models when given a simple prompt. Our method, dubbed Prompt Embedding Optimization (PEO), leverages a pre-trained text-to-image diffusion model as a backbone and optimizes the text embedding of a given simple and uncurated prompt to enhance the visual quality of the generated image. We achieve this by a tripartite objective function that improves the aesthetic fidelity of the generated image, ensures adherence to the optimized text embedding, and minimal divergence from the initial prompt. The latter is accomplished through a prompt preservation term. Additionally, PEO is training-free and backbone-independent. Quantitative and qualitative evaluations confirm the effectiveness of the proposed method, exceeding or equating the performance of state-of-the-art text-to-image and prompt adaptation methods.
