TIPO: Text to Image with Text Presampling for Prompt Optimization
Shih-Ying Yeh, Sang-Hyun Park, Yi Li, Giyeong Oh, Xuehai Wang, Min Song, Youngjae Yu
TL;DR
This work tackles prompt engineering for text-to-image generation by proposing TIPO, a lightweight, pre-sampling-based framework that transforms user prompts into model-aligned, richly detailed inputs without relying on costly LLMs or RL. By training a multitask language model on vast caption-driven text distributions and applying a three-stage refinement (tag enrichment, NL extension, and NL refinement), TIPO aligns prompts with the training data of target T2I models, improving fidelity, aesthetics, and artifact reduction. Across in-domain and out-of-domain evaluations, TIPO outperforms state-of-the-art baselines in multiple metrics and earns higher human preference, highlighting its practical potential for scalable, automated prompt engineering. The study also provides extensive implementation details, ablations, and a release of code and models to promote adoption and further research into efficient, robust generative prompting.
Abstract
TIPO (Text-to-Image Prompt Optimization) introduces an efficient approach for automatic prompt refinement in text-to-image (T2I) generation. Starting from simple user prompts, TIPO leverages a lightweight pre-trained model to expand these prompts into richer, detailed versions. Conceptually, TIPO samples refined prompts from a targeted sub-distribution within the broader semantic space, preserving the original intent while significantly improving visual quality, coherence, and detail. Unlike resource-intensive methods based on large language models (LLMs) or reinforcement learning (RL), TIPO provides computational efficiency and scalability, opening new possibilities for effective, automated prompt engineering in T2I tasks. We provide visual results, human preference report to investigate TIPO's effectiveness. Experimental evaluations on benchmark datasets demonstrate substantial improvements in aesthetic quality, significant reduction of visual artifacts, and enhanced alignment with target distributions along with significant human preference proficiency. These results highlight the importance of targeted prompt engineering in text-to-image tasks and indicate broader opportunities for automated prompt refinement.
