IPGO: Indirect Prompt Gradient Optimization for Parameter-Efficient Prompt-level Fine-Tuning on Text-to-Image Models
Jianping Ye, Michel Wedel, Kunpeng Zhang
TL;DR
IPGO introduces a parameter-efficient, reward-guided prompt-level fine-tuning strategy for diffusion-based text-to-image generation by injecting trainable prefix and suffix embeddings into prompts. The embeddings are optimized under orthonormal, range, and conformity constraints, and are parameterized with low-rank rotated bases to narrow the search space. IPGO+ adds a parameter-free cross-attention layer to reinforce interactions between inserted embeddings and the original prompt for prompt-batch training. Across COCO, DiffusionDB, and Pick-a-Pic prompts and three reward models, IPGO(+ ) consistently outperforms state-of-the-art baselines with substantially fewer trainable parameters, demonstrating strong generalization and potential for efficient alignment with aesthetics, semantics, and human preferences.
Abstract
Text-to-Image Diffusion models excel at generating images from text prompts but often exhibit suboptimal alignment with content semantics, aesthetics, and human preferences. To address these limitations, this study proposes a novel parameter-efficient framework, Indirect Prompt Gradient Optimization (IPGO), for prompt-level diffusion model fine-tuning. IPGO enhances prompt embeddings by injecting continuously differentiable embeddings at the beginning and end of the prompt embeddings, leveraging low-rank structures with the flexibility and nonlinearity from rotations. This approach enables gradient-based optimization of injected embeddings under range, orthonormality, and conformity constraints, effectively narrowing the search space, promoting a stable solution, and ensuring alignment between the embeddings of the injected embeddings and the original prompt. Its extension IPGO+ adds a parameter-free cross-attention mechanism on the prompt embedding to enforce dependencies between the original prompt and the inserted embeddings. We conduct extensive evaluations through prompt-wise (IPGO) and prompt-batch (IPGO+) training using three reward models of image aesthetics, image-text alignment, and human preferences across three datasets of varying complexity. The results show that IPGO consistently outperforms SOTA benchmarks, including stable diffusion v1.5 with raw prompts, text-embedding-based methods (TextCraftor), training-based methods (DRaFT and DDPO), and training-free methods (DPO-Diffusion, Promptist, and ChatGPT-4o). Specifically, IPGO achieves a win-rate exceeding 99% in prompt-wise learning, and IPGO+ achieves a comparable, but often better performance against current SOTAs (a 75% win rate) in prompt-batch learning. Moreover, we illustrate IPGO's generalizability and its capability to significantly enhance image quality while requiring minimal data and resources.
