Prompt Refinement with Image Pivot for Text-to-Image Generation
Jingtao Zhan, Qingyao Ai, Yiqun Liu, Yingwei Pan, Ting Yao, Jiaxin Mao, Shaoping Ma, Tao Mei
TL;DR
This work introduces Prompt Refinement with Image Pivot (PRIP), a pivot-based approach that translates user-friendly prompts into system-friendly prompts for text-to-image generation by leveraging the latent representation of a user-preferred image as an intermediary pivot. PRIP decomposes refinement into two data-rich tasks—learning user-pivot image preferences and decoding pivot images into system prompts—with warm-up supervised training followed by end-to-end PPO reinforcement learning, enabling strong performance and zero-shot transfer to unseen generation systems. The method uses a Preference Encoder (T5-based) and a Prompt Decoder (LLM-based) connected through a projected image pivot, trained with image and prompt logs, and evaluated across multiple diffusion models using both automated metrics and human judgments. Results show PRIP outperforms diverse baselines, including non-pivot and synthetic-pair methods, and remains effective on unseen systems, demonstrating robust generalizability and practical impact for user-friendly text-to-image interfaces while highlighting important ethical considerations. The work advances prompt engineering by decoupling user intent from system-language rendering and demonstrates how pivot-based strategies can leverage abundant auxiliary data to solve data-scarce cross-domain translation tasks.
Abstract
For text-to-image generation, automatically refining user-provided natural language prompts into the keyword-enriched prompts favored by systems is essential for the user experience. Such a prompt refinement process is analogous to translating the prompt from "user languages" into "system languages". However, the scarcity of such parallel corpora makes it difficult to train a prompt refinement model. Inspired by zero-shot machine translation techniques, we introduce Prompt Refinement with Image Pivot (PRIP). PRIP innovatively uses the latent representation of a user-preferred image as an intermediary "pivot" between the user and system languages. It decomposes the refinement process into two data-rich tasks: inferring representations of user-preferred images from user languages and subsequently translating image representations into system languages. Thus, it can leverage abundant data for training. Extensive experiments show that PRIP substantially outperforms a wide range of baselines and effectively transfers to unseen systems in a zero-shot manner.
