Table of Contents
Fetching ...

Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation

Subin Kim, Sangwoo Mo, Mamshad Nayeem Rizve, Yiran Xu, Difan Liu, Jinwoo Shin, Tobias Hinz

TL;DR

This work addresses misalignment in text-to-visual generation under inference-time scaling by redesigning prompts in response to recurring failures observed across scaled visuals. It introduces Element-level Factual Correction (EFC), a fine-grained, NLI-based verifier that assesses each prompt element against generated visuals, and a PRIS framework that uses EFC feedback to perform common-failure-aware prompt revisions. Empirical results demonstrate notable gains in prompt adherence for text-to-image (GenAI-Bench) and text-to-video (VBench2.0) tasks, including improvements when combined with existing visual-scaling methods. The study also provides a zero-shot verifier benchmark and shows that prompt redesign guided by cross-sample failures can outperform traditional brute-force visual scaling, highlighting the practical value of jointly scaling prompts and visuals at inference time.

Abstract

Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.

Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation

TL;DR

This work addresses misalignment in text-to-visual generation under inference-time scaling by redesigning prompts in response to recurring failures observed across scaled visuals. It introduces Element-level Factual Correction (EFC), a fine-grained, NLI-based verifier that assesses each prompt element against generated visuals, and a PRIS framework that uses EFC feedback to perform common-failure-aware prompt revisions. Empirical results demonstrate notable gains in prompt adherence for text-to-image (GenAI-Bench) and text-to-video (VBench2.0) tasks, including improvements when combined with existing visual-scaling methods. The study also provides a zero-shot verifier benchmark and shows that prompt redesign guided by cross-sample failures can outperform traditional brute-force visual scaling, highlighting the practical value of jointly scaling prompts and visuals at inference time.

Abstract

Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.

Paper Structure

This paper contains 28 sections, 20 figures, 10 tables.

Figures (20)

  • Figure 1: Our prompt redesign scales with compute, while fixed-prompts plateau. Given a user-provided complex text prompt, scaling visuals alone with a fixed prompt at inference time often leads to early performance plateaus, especially for unseen rewards (see orange line and boxes). It also repeatedly produces outputs that exhibit common failures and cover only parts of the prompt, even as compute increases to sample more visuals. In contrast, scaling visuals alongside our redesigned prompts yields progressively improved generations and substantially higher prompt-adherence scores as compute increases for both given and unseen rewards (see blue line and boxes).
  • Figure 2: Overview of Prompt Redesign for Inference-time Scaling (PRIS), which leverages diagnostic feedback from our verifier EFC to revise prompts during inference based on generated visuals. EFC decomposes prompts into semantic elements and verifies each element for fine-grained text-visual alignment (left). Guided by the EFC, PRIS proceeds as follows (right): Step 1 reviews initial generations with EFC; Step 2 selects top-$k$ successful samples and identifies recurring failures; Step 3 redesigns the prompt to emphasize common failures; and Step 4 regenerates visuals with the revised prompt and top-$k$ seeds. The process can be iterated by returning from Step 4 to Step 2.
  • Figure 3: Qualitative comparisons of T2I generation. $^{*}$ denotes results with standard prompt expansion.
  • Figure 4: Quantitative results of integrating PRIS with T2I visual scaling methods on GenAI-Bench. BoN refers to "Best-of-N" selection using fixed prompts. Bold shows the best.
  • Figure 5: Qualitative examples with increasing inference-time compute. PRIS generates progressively taller trees while satisfying all attributes, whereas BoN consistently misses some.
  • ...and 15 more figures