Table of Contents
Fetching ...

TIPO: Text to Image with Text Presampling for Prompt Optimization

Shih-Ying Yeh, Sang-Hyun Park, Yi Li, Giyeong Oh, Xuehai Wang, Min Song, Youngjae Yu

TL;DR

This work tackles prompt engineering for text-to-image generation by proposing TIPO, a lightweight, pre-sampling-based framework that transforms user prompts into model-aligned, richly detailed inputs without relying on costly LLMs or RL. By training a multitask language model on vast caption-driven text distributions and applying a three-stage refinement (tag enrichment, NL extension, and NL refinement), TIPO aligns prompts with the training data of target T2I models, improving fidelity, aesthetics, and artifact reduction. Across in-domain and out-of-domain evaluations, TIPO outperforms state-of-the-art baselines in multiple metrics and earns higher human preference, highlighting its practical potential for scalable, automated prompt engineering. The study also provides extensive implementation details, ablations, and a release of code and models to promote adoption and further research into efficient, robust generative prompting.

Abstract

TIPO (Text-to-Image Prompt Optimization) introduces an efficient approach for automatic prompt refinement in text-to-image (T2I) generation. Starting from simple user prompts, TIPO leverages a lightweight pre-trained model to expand these prompts into richer, detailed versions. Conceptually, TIPO samples refined prompts from a targeted sub-distribution within the broader semantic space, preserving the original intent while significantly improving visual quality, coherence, and detail. Unlike resource-intensive methods based on large language models (LLMs) or reinforcement learning (RL), TIPO provides computational efficiency and scalability, opening new possibilities for effective, automated prompt engineering in T2I tasks. We provide visual results, human preference report to investigate TIPO's effectiveness. Experimental evaluations on benchmark datasets demonstrate substantial improvements in aesthetic quality, significant reduction of visual artifacts, and enhanced alignment with target distributions along with significant human preference proficiency. These results highlight the importance of targeted prompt engineering in text-to-image tasks and indicate broader opportunities for automated prompt refinement.

TIPO: Text to Image with Text Presampling for Prompt Optimization

TL;DR

This work tackles prompt engineering for text-to-image generation by proposing TIPO, a lightweight, pre-sampling-based framework that transforms user prompts into model-aligned, richly detailed inputs without relying on costly LLMs or RL. By training a multitask language model on vast caption-driven text distributions and applying a three-stage refinement (tag enrichment, NL extension, and NL refinement), TIPO aligns prompts with the training data of target T2I models, improving fidelity, aesthetics, and artifact reduction. Across in-domain and out-of-domain evaluations, TIPO outperforms state-of-the-art baselines in multiple metrics and earns higher human preference, highlighting its practical potential for scalable, automated prompt engineering. The study also provides extensive implementation details, ablations, and a release of code and models to promote adoption and further research into efficient, robust generative prompting.

Abstract

TIPO (Text-to-Image Prompt Optimization) introduces an efficient approach for automatic prompt refinement in text-to-image (T2I) generation. Starting from simple user prompts, TIPO leverages a lightweight pre-trained model to expand these prompts into richer, detailed versions. Conceptually, TIPO samples refined prompts from a targeted sub-distribution within the broader semantic space, preserving the original intent while significantly improving visual quality, coherence, and detail. Unlike resource-intensive methods based on large language models (LLMs) or reinforcement learning (RL), TIPO provides computational efficiency and scalability, opening new possibilities for effective, automated prompt engineering in T2I tasks. We provide visual results, human preference report to investigate TIPO's effectiveness. Experimental evaluations on benchmark datasets demonstrate substantial improvements in aesthetic quality, significant reduction of visual artifacts, and enhanced alignment with target distributions along with significant human preference proficiency. These results highlight the importance of targeted prompt engineering in text-to-image tasks and indicate broader opportunities for automated prompt refinement.

Paper Structure

This paper contains 79 sections, 11 equations, 32 figures, 12 tables.

Figures (32)

  • Figure 1: Comparison of prompt optimization methods using LLM. (a) uses instructions for prompting but its understanding is constrained by the LLM's knowledge base, not T2I model. (b) relies on a curated prompt database, enhancing detail but limiting variety by not fully leveraging the T2I model's learned distribution. (c) optimizes using the scorer with RL, requiring multi-turn inference with additional cost. (d) aligns prompts with the T2I model's training distribution, ensuring detailed and various topic-related prompt generation that fits the target T2I model.
  • Figure 2: Illustration of various pre-sampling method for generating the T2I prompt An astronaut rides horse on Mars + $<\mspace{-5mu}\mathcal{T}\mspace{-5mu}>$. (a) yields a basic image. (b) enhances details of images but requires manual refinement. (c) adding random words may introduce irrelevant content (red boxes), exceeding the user's intent. (d) TIPO pre-sampling (ours) aligns outputs with expected intent, maintaining both detail and variety. $<\mspace{-5mu}\mathcal{T}\mspace{-5mu}>$ represents a transformation function for pre-sampling.
  • Figure 3: TIPO’s task flow within a single generation. Starting from any node, each arrow represents a sequential extension step, with other prompts used as metadata if provided. This task design enables efficient and flexible prompt extension across multiple tasks.
  • Figure 4: An example scenario of TIPO workflow. (a) A generated image and prompts. (b) Prompt optimization of TIPO, from simple user input $p_s$ to detailed final output $p_d$. The shading from gray to light sky blue represents an increase in context richness in the prompt.
  • Figure 5: Generated images from 4 types of prompts: (a) simple scenery tag, (b) scenery tag enhanced by TIPO, (c) truncated ($<$ 40 words) long prompt, (d) TIPO-enhanced truncated prompt. TIPO adds detail and maintains variety, yielding coherent images from simple prompts.
  • ...and 27 more figures