Table of Contents
Fetching ...

IPGO: Indirect Prompt Gradient Optimization for Parameter-Efficient Prompt-level Fine-Tuning on Text-to-Image Models

Jianping Ye, Michel Wedel, Kunpeng Zhang

TL;DR

IPGO introduces a parameter-efficient, reward-guided prompt-level fine-tuning strategy for diffusion-based text-to-image generation by injecting trainable prefix and suffix embeddings into prompts. The embeddings are optimized under orthonormal, range, and conformity constraints, and are parameterized with low-rank rotated bases to narrow the search space. IPGO+ adds a parameter-free cross-attention layer to reinforce interactions between inserted embeddings and the original prompt for prompt-batch training. Across COCO, DiffusionDB, and Pick-a-Pic prompts and three reward models, IPGO(+ ) consistently outperforms state-of-the-art baselines with substantially fewer trainable parameters, demonstrating strong generalization and potential for efficient alignment with aesthetics, semantics, and human preferences.

Abstract

Text-to-Image Diffusion models excel at generating images from text prompts but often exhibit suboptimal alignment with content semantics, aesthetics, and human preferences. To address these limitations, this study proposes a novel parameter-efficient framework, Indirect Prompt Gradient Optimization (IPGO), for prompt-level diffusion model fine-tuning. IPGO enhances prompt embeddings by injecting continuously differentiable embeddings at the beginning and end of the prompt embeddings, leveraging low-rank structures with the flexibility and nonlinearity from rotations. This approach enables gradient-based optimization of injected embeddings under range, orthonormality, and conformity constraints, effectively narrowing the search space, promoting a stable solution, and ensuring alignment between the embeddings of the injected embeddings and the original prompt. Its extension IPGO+ adds a parameter-free cross-attention mechanism on the prompt embedding to enforce dependencies between the original prompt and the inserted embeddings. We conduct extensive evaluations through prompt-wise (IPGO) and prompt-batch (IPGO+) training using three reward models of image aesthetics, image-text alignment, and human preferences across three datasets of varying complexity. The results show that IPGO consistently outperforms SOTA benchmarks, including stable diffusion v1.5 with raw prompts, text-embedding-based methods (TextCraftor), training-based methods (DRaFT and DDPO), and training-free methods (DPO-Diffusion, Promptist, and ChatGPT-4o). Specifically, IPGO achieves a win-rate exceeding 99% in prompt-wise learning, and IPGO+ achieves a comparable, but often better performance against current SOTAs (a 75% win rate) in prompt-batch learning. Moreover, we illustrate IPGO's generalizability and its capability to significantly enhance image quality while requiring minimal data and resources.

IPGO: Indirect Prompt Gradient Optimization for Parameter-Efficient Prompt-level Fine-Tuning on Text-to-Image Models

TL;DR

IPGO introduces a parameter-efficient, reward-guided prompt-level fine-tuning strategy for diffusion-based text-to-image generation by injecting trainable prefix and suffix embeddings into prompts. The embeddings are optimized under orthonormal, range, and conformity constraints, and are parameterized with low-rank rotated bases to narrow the search space. IPGO+ adds a parameter-free cross-attention layer to reinforce interactions between inserted embeddings and the original prompt for prompt-batch training. Across COCO, DiffusionDB, and Pick-a-Pic prompts and three reward models, IPGO(+ ) consistently outperforms state-of-the-art baselines with substantially fewer trainable parameters, demonstrating strong generalization and potential for efficient alignment with aesthetics, semantics, and human preferences.

Abstract

Text-to-Image Diffusion models excel at generating images from text prompts but often exhibit suboptimal alignment with content semantics, aesthetics, and human preferences. To address these limitations, this study proposes a novel parameter-efficient framework, Indirect Prompt Gradient Optimization (IPGO), for prompt-level diffusion model fine-tuning. IPGO enhances prompt embeddings by injecting continuously differentiable embeddings at the beginning and end of the prompt embeddings, leveraging low-rank structures with the flexibility and nonlinearity from rotations. This approach enables gradient-based optimization of injected embeddings under range, orthonormality, and conformity constraints, effectively narrowing the search space, promoting a stable solution, and ensuring alignment between the embeddings of the injected embeddings and the original prompt. Its extension IPGO+ adds a parameter-free cross-attention mechanism on the prompt embedding to enforce dependencies between the original prompt and the inserted embeddings. We conduct extensive evaluations through prompt-wise (IPGO) and prompt-batch (IPGO+) training using three reward models of image aesthetics, image-text alignment, and human preferences across three datasets of varying complexity. The results show that IPGO consistently outperforms SOTA benchmarks, including stable diffusion v1.5 with raw prompts, text-embedding-based methods (TextCraftor), training-based methods (DRaFT and DDPO), and training-free methods (DPO-Diffusion, Promptist, and ChatGPT-4o). Specifically, IPGO achieves a win-rate exceeding 99% in prompt-wise learning, and IPGO+ achieves a comparable, but often better performance against current SOTAs (a 75% win rate) in prompt-batch learning. Moreover, we illustrate IPGO's generalizability and its capability to significantly enhance image quality while requiring minimal data and resources.

Paper Structure

This paper contains 46 sections, 14 equations, 14 figures, 7 tables, 1 algorithm.

Figures (14)

  • Figure 1: IPGO(+) inserts trainable prefix and suffix embeddings to the text embeddings of the prompt in the CLIP text encoder space/text embedding space, and then sends back reward signals through backpropagation under three constraints: Orthonormality, Range and Conformity. IPGO+ further adds an additional attention layer (indicated by the dashed black box) to the text embeddings prior to image sampling.
  • Figure 2: Example images generated with Stable Diffusion v1.5 using the raw prompt (row 1), IPGO (row 2), DRaFT-1 (row 3) and TextCraftor (row 4), evaluated according to the HPSv2 reward.
  • Figure 3: The figure illustrates the use of rotation in the optimization and compares the optimizations with rotation (left) and without rotation (right). The yellow star is the optimum point. The red dashed lines on the left are the circles with radius of the length of the current $x_t$. The purple line is the shortest distance between the initial point, the origin, and the optimum point. With rotations, the updates are calculated such that the total path is the shortest.
  • Figure 4: Example images generated by prompt-batch training with IPGO+ on the HPSv2 and aesthetic rewards. The left three columns are the resulting images after training on the COCO Mixed data; the right three columns show images resulting from inserting the trained prefix and suffix in the unseen animal category prompts.
  • Figure 5: The rows show six sets of example images (see the columns of Figure \ref{['fig:batch_aes_images']} generated from prompts that use convex combinations of the prefixes/suffixes from human preference and aesthetics rewards. The top row indicates the combination weights, in the format of (human preference weight, aesthetics weight). We find the combined style smoothly changes from purely human-preference-styled to purely aesthetics-styled.
  • ...and 9 more figures