Table of Contents
Fetching ...

Prompt Refinement with Image Pivot for Text-to-Image Generation

Jingtao Zhan, Qingyao Ai, Yiqun Liu, Yingwei Pan, Ting Yao, Jiaxin Mao, Shaoping Ma, Tao Mei

TL;DR

This work introduces Prompt Refinement with Image Pivot (PRIP), a pivot-based approach that translates user-friendly prompts into system-friendly prompts for text-to-image generation by leveraging the latent representation of a user-preferred image as an intermediary pivot. PRIP decomposes refinement into two data-rich tasks—learning user-pivot image preferences and decoding pivot images into system prompts—with warm-up supervised training followed by end-to-end PPO reinforcement learning, enabling strong performance and zero-shot transfer to unseen generation systems. The method uses a Preference Encoder (T5-based) and a Prompt Decoder (LLM-based) connected through a projected image pivot, trained with image and prompt logs, and evaluated across multiple diffusion models using both automated metrics and human judgments. Results show PRIP outperforms diverse baselines, including non-pivot and synthetic-pair methods, and remains effective on unseen systems, demonstrating robust generalizability and practical impact for user-friendly text-to-image interfaces while highlighting important ethical considerations. The work advances prompt engineering by decoupling user intent from system-language rendering and demonstrates how pivot-based strategies can leverage abundant auxiliary data to solve data-scarce cross-domain translation tasks.

Abstract

For text-to-image generation, automatically refining user-provided natural language prompts into the keyword-enriched prompts favored by systems is essential for the user experience. Such a prompt refinement process is analogous to translating the prompt from "user languages" into "system languages". However, the scarcity of such parallel corpora makes it difficult to train a prompt refinement model. Inspired by zero-shot machine translation techniques, we introduce Prompt Refinement with Image Pivot (PRIP). PRIP innovatively uses the latent representation of a user-preferred image as an intermediary "pivot" between the user and system languages. It decomposes the refinement process into two data-rich tasks: inferring representations of user-preferred images from user languages and subsequently translating image representations into system languages. Thus, it can leverage abundant data for training. Extensive experiments show that PRIP substantially outperforms a wide range of baselines and effectively transfers to unseen systems in a zero-shot manner.

Prompt Refinement with Image Pivot for Text-to-Image Generation

TL;DR

This work introduces Prompt Refinement with Image Pivot (PRIP), a pivot-based approach that translates user-friendly prompts into system-friendly prompts for text-to-image generation by leveraging the latent representation of a user-preferred image as an intermediary pivot. PRIP decomposes refinement into two data-rich tasks—learning user-pivot image preferences and decoding pivot images into system prompts—with warm-up supervised training followed by end-to-end PPO reinforcement learning, enabling strong performance and zero-shot transfer to unseen generation systems. The method uses a Preference Encoder (T5-based) and a Prompt Decoder (LLM-based) connected through a projected image pivot, trained with image and prompt logs, and evaluated across multiple diffusion models using both automated metrics and human judgments. Results show PRIP outperforms diverse baselines, including non-pivot and synthetic-pair methods, and remains effective on unseen systems, demonstrating robust generalizability and practical impact for user-friendly text-to-image interfaces while highlighting important ethical considerations. The work advances prompt engineering by decoupling user intent from system-language rendering and demonstrates how pivot-based strategies can leverage abundant auxiliary data to solve data-scarce cross-domain translation tasks.

Abstract

For text-to-image generation, automatically refining user-provided natural language prompts into the keyword-enriched prompts favored by systems is essential for the user experience. Such a prompt refinement process is analogous to translating the prompt from "user languages" into "system languages". However, the scarcity of such parallel corpora makes it difficult to train a prompt refinement model. Inspired by zero-shot machine translation techniques, we introduce Prompt Refinement with Image Pivot (PRIP). PRIP innovatively uses the latent representation of a user-preferred image as an intermediary "pivot" between the user and system languages. It decomposes the refinement process into two data-rich tasks: inferring representations of user-preferred images from user languages and subsequently translating image representations into system languages. Thus, it can leverage abundant data for training. Extensive experiments show that PRIP substantially outperforms a wide range of baselines and effectively transfers to unseen systems in a zero-shot manner.
Paper Structure (33 sections, 8 equations, 4 figures, 8 tables)

This paper contains 33 sections, 8 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: PRIP Model Architecture: a Preference Encoder and a Prompt Decoder. (1) Upon receiving a user prompt, Preference Encoder first applies a transformer to derive a token-level representation. A subsequent transformer leverages cross-attention to deduce the image preference and yields an image representation. (2) Prompt Decoder then employs a linear layer to align the dimensionality. This aligned representation is integrated into a template and input into a large language model, which generates the refined system prompt.
  • Figure 2: Training Preference Encoder: Prompts and preferred images are paired to create the training set. The objective is to minimize the Mean Squared Error between the ground-truth image representations and the predictions from the Preference Encoder.
  • Figure 3: Training Prompt Decoder: Prompts that can generate impressive images are sampled as the system language. The objective is to predict the system language based on the associated image representation.
  • Figure 4: End-to-End RL Training: Given a user prompt, PRIP generates a refined prompt, and Reward Model evaluates user preference scores for generated images. The differential in scores serves as the reward, and PRIP is updated with PPO Gradient.