Table of Contents
Fetching ...

A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization

Ke Chen, Yifeng Wang, Hassan Almosapeeh, Haohan Wang

TL;DR

This paper addresses the fragmented nature of prompt quality evaluation by proposing a unified, metric-grounded evaluation framework that operates without executing prompts. It constructs a diverse prompt corpus, selects informative multi-dimensional metrics, and trains an execution-free evaluator that predicts prompt quality and downstream performance. The evaluator then informs a query-dependent optimization process, yielding stable, interpretable improvements across eight datasets and three backbone models. The approach demonstrates robust generalization to unseen domains and offers a portable, model-agnostic pipeline for practical prompt optimization in complex, multi-agent environments.

Abstract

Most prompt-optimization methods refine a single static template, making them ineffective in complex and dynamic user scenarios. Existing query-dependent approaches rely on unstable textual feedback or black-box reward models, providing weak and uninterpretable optimization signals. More fundamentally, prompt quality itself lacks a unified, systematic definition, resulting in fragmented and unreliable evaluation signals. Our approach first establishes a performance-oriented, systematic, and comprehensive prompt evaluation framework. Furthermore, we develop and finetune an execution-free evaluator that predicts multi-dimensional quality scores directly from text. The evaluator then instructs a metric-aware optimizer that diagnoses failure modes and rewrites prompts in an interpretable, query-dependent manner. Our evaluator achieves the strongest accuracy in predicting prompt performance, and the evaluation-instructed optimization consistently surpass both static-template and query-dependent baselines across eight datasets and on three backbone models. Overall, we propose a unified, metric-grounded perspective on prompt quality, and demonstrated that our evaluation-instructed optimization pipeline delivers stable, interpretable, and model-agnostic improvements across diverse tasks.

A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization

TL;DR

This paper addresses the fragmented nature of prompt quality evaluation by proposing a unified, metric-grounded evaluation framework that operates without executing prompts. It constructs a diverse prompt corpus, selects informative multi-dimensional metrics, and trains an execution-free evaluator that predicts prompt quality and downstream performance. The evaluator then informs a query-dependent optimization process, yielding stable, interpretable improvements across eight datasets and three backbone models. The approach demonstrates robust generalization to unseen domains and offers a portable, model-agnostic pipeline for practical prompt optimization in complex, multi-agent environments.

Abstract

Most prompt-optimization methods refine a single static template, making them ineffective in complex and dynamic user scenarios. Existing query-dependent approaches rely on unstable textual feedback or black-box reward models, providing weak and uninterpretable optimization signals. More fundamentally, prompt quality itself lacks a unified, systematic definition, resulting in fragmented and unreliable evaluation signals. Our approach first establishes a performance-oriented, systematic, and comprehensive prompt evaluation framework. Furthermore, we develop and finetune an execution-free evaluator that predicts multi-dimensional quality scores directly from text. The evaluator then instructs a metric-aware optimizer that diagnoses failure modes and rewrites prompts in an interpretable, query-dependent manner. Our evaluator achieves the strongest accuracy in predicting prompt performance, and the evaluation-instructed optimization consistently surpass both static-template and query-dependent baselines across eight datasets and on three backbone models. Overall, we propose a unified, metric-grounded perspective on prompt quality, and demonstrated that our evaluation-instructed optimization pipeline delivers stable, interpretable, and model-agnostic improvements across diverse tasks.

Paper Structure

This paper contains 31 sections, 10 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of our pipeline.
  • Figure 2: Feature importance of prompt evaluation metrics (XGBoost).
  • Figure 3: Learned metric weights of the evaluator.
  • Figure 4: Improvement of our optimized prompts over the LLM-only baseline across three backbone models (LLaMA-3, LLaMA-3.1, and GPT-4o). Positive gains are shown in green and negative changes in red, with color intensity reflecting model size.