Table of Contents
Fetching ...

GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation

Wen Ye, Zhaocheng Liu, Yuwei Gui, Tingyu Yuan, Yunyue Su, Bowen Fang, Chaoyang Zhao, Qiang Liu, Liang Wang

TL;DR

GenPilot addresses semantic misalignment and compositionality gaps in text-to-image generation by reframing prompt optimization as a test-time search in the input space. It introduces a two-stage, plug-and-play multi-agent framework that combines error analysis (via VQA and caption-based signals) with memory-driven, clustering-guided refinement, all without model training. The system leverages a multimodal LLM as the scoring and reasoning backbone to iteratively improve prompts, achieving substantial improvements on challenging benchmarks (up to 16.9% on DPG-bench and 5.7% on GenEval) and demonstrating strong generalization across diverse diffusion backbones. By providing a structured error taxonomy (with 35 patterns) and a modular design, GenPilot offers interpretable guidance for prompt controllability and lays groundwork for future test-time prompt optimization research.

Abstract

Text-to-image synthesis has made remarkable progress, yet accurately interpreting complex and lengthy prompts remains challenging, often resulting in semantic inconsistencies and missing details. Existing solutions, such as fine-tuning, are model-specific and require training, while prior automatic prompt optimization (APO) approaches typically lack systematic error analysis and refinement strategies, resulting in limited reliability and effectiveness. Meanwhile, test-time scaling methods operate on fixed prompts and on noise or sample numbers, limiting their interpretability and adaptability. To solve these, we introduce a flexible and efficient test-time prompt optimization strategy that operates directly on the input text. We propose a plug-and-play multi-agent system called GenPilot, integrating error analysis, clustering-based adaptive exploration, fine-grained verification, and a memory module for iterative optimization. Our approach is model-agnostic, interpretable, and well-suited for handling long and complex prompts. Simultaneously, we summarize the common patterns of errors and the refinement strategy, offering more experience and encouraging further exploration. Experiments on DPG-bench and Geneval with improvements of up to 16.9% and 5.7% demonstrate the strong capability of our methods in enhancing the text and image consistency and structural coherence of generated images, revealing the effectiveness of our test-time prompt optimization strategy. The code is available at https://github.com/27yw/GenPilot.

GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation

TL;DR

GenPilot addresses semantic misalignment and compositionality gaps in text-to-image generation by reframing prompt optimization as a test-time search in the input space. It introduces a two-stage, plug-and-play multi-agent framework that combines error analysis (via VQA and caption-based signals) with memory-driven, clustering-guided refinement, all without model training. The system leverages a multimodal LLM as the scoring and reasoning backbone to iteratively improve prompts, achieving substantial improvements on challenging benchmarks (up to 16.9% on DPG-bench and 5.7% on GenEval) and demonstrating strong generalization across diverse diffusion backbones. By providing a structured error taxonomy (with 35 patterns) and a modular design, GenPilot offers interpretable guidance for prompt controllability and lays groundwork for future test-time prompt optimization research.

Abstract

Text-to-image synthesis has made remarkable progress, yet accurately interpreting complex and lengthy prompts remains challenging, often resulting in semantic inconsistencies and missing details. Existing solutions, such as fine-tuning, are model-specific and require training, while prior automatic prompt optimization (APO) approaches typically lack systematic error analysis and refinement strategies, resulting in limited reliability and effectiveness. Meanwhile, test-time scaling methods operate on fixed prompts and on noise or sample numbers, limiting their interpretability and adaptability. To solve these, we introduce a flexible and efficient test-time prompt optimization strategy that operates directly on the input text. We propose a plug-and-play multi-agent system called GenPilot, integrating error analysis, clustering-based adaptive exploration, fine-grained verification, and a memory module for iterative optimization. Our approach is model-agnostic, interpretable, and well-suited for handling long and complex prompts. Simultaneously, we summarize the common patterns of errors and the refinement strategy, offering more experience and encouraging further exploration. Experiments on DPG-bench and Geneval with improvements of up to 16.9% and 5.7% demonstrate the strong capability of our methods in enhancing the text and image consistency and structural coherence of generated images, revealing the effectiveness of our test-time prompt optimization strategy. The code is available at https://github.com/27yw/GenPilot.

Paper Structure

This paper contains 43 sections, 10 equations, 21 figures, 8 tables.

Figures (21)

  • Figure 1: Visualized examples from DALL-E 3 betker2023improving with GenPilot processing complicated and lengthy prompts. Compared to the prompt engineering (PE), generative models with GenPilot successfully achieve accurate results, addressing both the semantic gap and even the challenging tasks of exclusion of certain objects.
  • Figure 2: Overview of our proposed multi-agent system for test-time prompt optimization. GenPilot utilizes a multimodal large language model as the agent. In stage 1, we first decompose the prompt, then we introduce the error integration strategy based on image caption and VQA results, and map the error to the original prompt. In stage 2, we introduce the test-time scaling by formulating the problem as a search problem and operating on the input text space. The test-time prompt optimization is iteratively processed with a refinement agent, an MLLM scorer, a clustering algorithm macqueen1967some, and the memory module to sample the optimal currently.
  • Figure 3: Qualitative comparison with different methods on the DPG-bench challenging dataset on different generative models. The left columns display two generations from SDv1.4 and one from DALL-E 3. The right columns present the results from SDv2.1, SD3, and FLUX.1 schnell. For the SD series, we select the best from BeautifulPrompt and MagicPrompt, along with the PE methods for comparison. GenPilot consistently generates error-free images across all scenarios, demonstrating its superiority in synthesizing high-quality and accurate images.
  • Figure 4: Qualitative examples on GenEval. The left columns show the comparison of FLUX.1 schnell, FLUX.1 schnell and PE for enhancement, and FLUX.1 schnell with GenPilot. The right columns provide the results of PixArt-$\alpha$, PixArt-$\alpha$ and PE for enhancement, and PixArt-$\alpha$ with GenPilot. GenPilot achieves great success in both position processing and unrealistic prompt generation, highlighting its potential and generalization to improve the quality of images.
  • Figure 5: Visualization of clustering result on one case with the number of clusters set to be 5.
  • ...and 16 more figures