Prompt Refinement with Image Pivot for Text-to-Image Generation

Jingtao Zhan; Qingyao Ai; Yiqun Liu; Yingwei Pan; Ting Yao; Jiaxin Mao; Shaoping Ma; Tao Mei

Prompt Refinement with Image Pivot for Text-to-Image Generation

Jingtao Zhan, Qingyao Ai, Yiqun Liu, Yingwei Pan, Ting Yao, Jiaxin Mao, Shaoping Ma, Tao Mei

TL;DR

This work introduces Prompt Refinement with Image Pivot (PRIP), a pivot-based approach that translates user-friendly prompts into system-friendly prompts for text-to-image generation by leveraging the latent representation of a user-preferred image as an intermediary pivot. PRIP decomposes refinement into two data-rich tasks—learning user-pivot image preferences and decoding pivot images into system prompts—with warm-up supervised training followed by end-to-end PPO reinforcement learning, enabling strong performance and zero-shot transfer to unseen generation systems. The method uses a Preference Encoder (T5-based) and a Prompt Decoder (LLM-based) connected through a projected image pivot, trained with image and prompt logs, and evaluated across multiple diffusion models using both automated metrics and human judgments. Results show PRIP outperforms diverse baselines, including non-pivot and synthetic-pair methods, and remains effective on unseen systems, demonstrating robust generalizability and practical impact for user-friendly text-to-image interfaces while highlighting important ethical considerations. The work advances prompt engineering by decoupling user intent from system-language rendering and demonstrates how pivot-based strategies can leverage abundant auxiliary data to solve data-scarce cross-domain translation tasks.

Abstract

For text-to-image generation, automatically refining user-provided natural language prompts into the keyword-enriched prompts favored by systems is essential for the user experience. Such a prompt refinement process is analogous to translating the prompt from "user languages" into "system languages". However, the scarcity of such parallel corpora makes it difficult to train a prompt refinement model. Inspired by zero-shot machine translation techniques, we introduce Prompt Refinement with Image Pivot (PRIP). PRIP innovatively uses the latent representation of a user-preferred image as an intermediary "pivot" between the user and system languages. It decomposes the refinement process into two data-rich tasks: inferring representations of user-preferred images from user languages and subsequently translating image representations into system languages. Thus, it can leverage abundant data for training. Extensive experiments show that PRIP substantially outperforms a wide range of baselines and effectively transfers to unseen systems in a zero-shot manner.

Prompt Refinement with Image Pivot for Text-to-Image Generation

TL;DR

Abstract

Paper Structure (33 sections, 8 equations, 4 figures, 8 tables)

This paper contains 33 sections, 8 equations, 4 figures, 8 tables.

Introduction
Related Work
Method
Problem Analysis
Model Architecture
Disentangled Supervised Training
Training User-Pivot Preference
Training Pivot-System Decoding
End-to-End User-Pivot-System Training
Inference Process
Experimental Setup
Evaluation Setup
Baselines
Implementation Details
Experimental Results
...and 18 more sections

Figures (4)

Figure 1: PRIP Model Architecture: a Preference Encoder and a Prompt Decoder. (1) Upon receiving a user prompt, Preference Encoder first applies a transformer to derive a token-level representation. A subsequent transformer leverages cross-attention to deduce the image preference and yields an image representation. (2) Prompt Decoder then employs a linear layer to align the dimensionality. This aligned representation is integrated into a template and input into a large language model, which generates the refined system prompt.
Figure 2: Training Preference Encoder: Prompts and preferred images are paired to create the training set. The objective is to minimize the Mean Squared Error between the ground-truth image representations and the predictions from the Preference Encoder.
Figure 3: Training Prompt Decoder: Prompts that can generate impressive images are sampled as the system language. The objective is to predict the system language based on the associated image representation.
Figure 4: End-to-End RL Training: Given a user prompt, PRIP generates a refined prompt, and Reward Model evaluates user preference scores for generated images. The differential in scores serves as the reward, and PRIP is updated with PPO Gradient.

Prompt Refinement with Image Pivot for Text-to-Image Generation

TL;DR

Abstract

Prompt Refinement with Image Pivot for Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)