Table of Contents
Fetching ...

A Toolbox for Improving Evolutionary Prompt Search

Daniel Grießhaber, Maximilian Kimmich, Johannes Maucher, Ngoc Thang Vu

TL;DR

This paper tackles the high cost and fragility of evolutionary prompt optimization for LLMs by introducing a modular framework that decomposes the evolution process into initialization, evolution, evaluation, and selection. It leverages an LLM as both the operator and judge, augments the workflow with a human-in-the-loop, and adopts chain-of-instructions prompting to improve control and feedback granularity (CoI prompting). Efficient evaluation strategies, including moment-based and parent-based early stopping plus strategic data ordering, reduce computational overhead while preserving performance. Empirical results across diverse NLP tasks show that CoI prompting, together with an LLM judge and human feedback, yields consistent improvements and better resource efficiency, with the approach proving robust across multiple LLM variants; the authors also release the code to facilitate further research and application.

Abstract

Evolutionary prompt optimization has demonstrated effectiveness in refining prompts for LLMs. However, existing approaches lack robust operators and efficient evaluation mechanisms. In this work, we propose several key improvements to evolutionary prompt optimization that can partially generalize to prompt optimization in general: 1) decomposing evolution into distinct steps to enhance the evolution and its control, 2) introducing an LLM-based judge to verify the evolutions, 3) integrating human feedback to refine the evolutionary operator, and 4) developing more efficient evaluation strategies that maintain performance while reducing computational overhead. Our approach improves both optimization quality and efficiency. We release our code, enabling prompt optimization on new tasks and facilitating further research in this area.

A Toolbox for Improving Evolutionary Prompt Search

TL;DR

This paper tackles the high cost and fragility of evolutionary prompt optimization for LLMs by introducing a modular framework that decomposes the evolution process into initialization, evolution, evaluation, and selection. It leverages an LLM as both the operator and judge, augments the workflow with a human-in-the-loop, and adopts chain-of-instructions prompting to improve control and feedback granularity (CoI prompting). Efficient evaluation strategies, including moment-based and parent-based early stopping plus strategic data ordering, reduce computational overhead while preserving performance. Empirical results across diverse NLP tasks show that CoI prompting, together with an LLM judge and human feedback, yields consistent improvements and better resource efficiency, with the approach proving robust across multiple LLM variants; the authors also release the code to facilitate further research and application.

Abstract

Evolutionary prompt optimization has demonstrated effectiveness in refining prompts for LLMs. However, existing approaches lack robust operators and efficient evaluation mechanisms. In this work, we propose several key improvements to evolutionary prompt optimization that can partially generalize to prompt optimization in general: 1) decomposing evolution into distinct steps to enhance the evolution and its control, 2) introducing an LLM-based judge to verify the evolutions, 3) integrating human feedback to refine the evolutionary operator, and 4) developing more efficient evaluation strategies that maintain performance while reducing computational overhead. Our approach improves both optimization quality and efficiency. We release our code, enabling prompt optimization on new tasks and facilitating further research in this area.

Paper Structure

This paper contains 36 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: An overview of the individual components ascribed to our proposed method (blue) compared to only using a one-step instruction for the operator (green).
  • Figure 2: Box plot illustrating the quantitative effectiveness of various LLM based on performance metrics across our evaluation set. The models are listed in section \ref{['sec:experimental-setup-other-models']}. The y-axis represents the relative improvement in performance if CoI and the judge are used. Mean performance is consistently improved across all tested models, with Gemma profiting the most.
  • Figure 3: An example for the first step of evolution for DE: The expected response mentions mutations for all spotted differences (marked in red) and omits the similarities as well as the last statement, which is evidently wrong (marked using strikeout in red). Demonstration samples for in-context learning are omitted for clarity.