Table of Contents
Fetching ...

Fast Prompt Alignment for Text-to-Image Generation

Khalil Mrini, Hanlin Lu, Linjie Yang, Weilin Huang, Heng Wang

TL;DR

Fast Prompt Alignment (FPA) tackles the slow, iterative nature of text-to-image prompt optimization by converting iterative gains into a single-pass workflow. It uses a large LLM to paraphrase prompts, then either fine-tunes a 7B model for real-time inference or applies in-context learning with a 123B model to produce optimized prompts on the fly. Across COCO Captions and PartiPrompts, FPA achieves competitive TIFA and VQA alignment with substantial speedups, a finding reinforced by a human study that shows strong correlation between human judgments and automated metrics. The results suggest FPA as a scalable solution for real-time, high-demand T2I applications, with code released to enable broader adoption and further research.

Abstract

Text-to-image generation has advanced rapidly, yet aligning complex textual prompts with generated visuals remains challenging, especially with intricate object relationships and fine-grained details. This paper introduces Fast Prompt Alignment (FPA), a prompt optimization framework that leverages a one-pass approach, enhancing text-to-image alignment efficiency without the iterative overhead typical of current methods like OPT2I. FPA uses large language models (LLMs) for single-iteration prompt paraphrasing, followed by fine-tuning or in-context learning with optimized prompts to enable real-time inference, reducing computational demands while preserving alignment fidelity. Extensive evaluations on the COCO Captions and PartiPrompts datasets demonstrate that FPA achieves competitive text-image alignment scores at a fraction of the processing time, as validated through both automated metrics (TIFA, VQA) and human evaluation. A human study with expert annotators further reveals a strong correlation between human alignment judgments and automated scores, underscoring the robustness of FPA's improvements. The proposed method showcases a scalable, efficient alternative to iterative prompt optimization, enabling broader applicability in real-time, high-demand settings. The codebase is provided to facilitate further research: https://github.com/tiktok/fast_prompt_alignment

Fast Prompt Alignment for Text-to-Image Generation

TL;DR

Fast Prompt Alignment (FPA) tackles the slow, iterative nature of text-to-image prompt optimization by converting iterative gains into a single-pass workflow. It uses a large LLM to paraphrase prompts, then either fine-tunes a 7B model for real-time inference or applies in-context learning with a 123B model to produce optimized prompts on the fly. Across COCO Captions and PartiPrompts, FPA achieves competitive TIFA and VQA alignment with substantial speedups, a finding reinforced by a human study that shows strong correlation between human judgments and automated metrics. The results suggest FPA as a scalable solution for real-time, high-demand T2I applications, with code released to enable broader adoption and further research.

Abstract

Text-to-image generation has advanced rapidly, yet aligning complex textual prompts with generated visuals remains challenging, especially with intricate object relationships and fine-grained details. This paper introduces Fast Prompt Alignment (FPA), a prompt optimization framework that leverages a one-pass approach, enhancing text-to-image alignment efficiency without the iterative overhead typical of current methods like OPT2I. FPA uses large language models (LLMs) for single-iteration prompt paraphrasing, followed by fine-tuning or in-context learning with optimized prompts to enable real-time inference, reducing computational demands while preserving alignment fidelity. Extensive evaluations on the COCO Captions and PartiPrompts datasets demonstrate that FPA achieves competitive text-image alignment scores at a fraction of the processing time, as validated through both automated metrics (TIFA, VQA) and human evaluation. A human study with expert annotators further reveals a strong correlation between human alignment judgments and automated scores, underscoring the robustness of FPA's improvements. The proposed method showcases a scalable, efficient alternative to iterative prompt optimization, enabling broader applicability in real-time, high-demand settings. The codebase is provided to facilitate further research: https://github.com/tiktok/fast_prompt_alignment

Paper Structure

This paper contains 30 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Example of iterative optimization for a prompt used to generate images. The model we use is Stable Diffusion 3 Medium. We provide Google image search results for the original prompt for comparison.
  • Figure 2: Diagram of the Fast Prompt Alignment (FPA) method for text-to-image generation. The process includes paraphrase generation, image scoring, and both fine-tuning and in-context learning methods for efficient prompt optimization. We generate 4 paraphrases in our case, and we illustrate that with the 4 parallel arrows.