GAAPO: Genetic Algorithmic Applied to Prompt Optimization
Xavier Sécheresse, Jacques-Yves Guilbert--Ly, Antoine Villedieu de Torcy
TL;DR
GAAPO addresses the challenge of automated prompt optimization for LLMs by combining a genetic algorithm with multiple prompt-generation strategies in a unified evolutionary framework. The approach is evaluated on ETHOS, MMLU-Pro, and GPQA, showing superior validation performance and competitive generalization compared with baselines like APO, OPRO, and Mutator, while also analyzing the effects of population size, selection methods, and model-specific prompt generators. Key contributions include a modular architecture that integrates forced and random evolution strategies, a bandit- and SH-enabled evaluation scheme to reduce computational cost, and cross-model analyses highlighting trade-offs between performance and generalization. The work provides practical insights into automatic prompt optimization and establishes GAAPO as a flexible, extensible platform for advancing LLM prompting across tasks and models.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, with their performance heavily dependent on the quality of input prompts. While prompt engineering has proven effective, it typically relies on manual adjustments, making it time-consuming and potentially suboptimal. This paper introduces GAAPO (Genetic Algorithm Applied to Prompt Optimization), a novel hybrid optimization framework that leverages genetic algorithm principles to evolve prompts through successive generations. Unlike traditional genetic approaches that rely solely on mutation and crossover operations, GAAPO integrates multiple specialized prompt generation strategies within its evolutionary framework. Through extensive experimentation on diverse datasets including ETHOS, MMLU-Pro, and GPQA, our analysis reveals several important point for the future development of automatic prompt optimization methods: importance of the tradeoff between the population size and the number of generations, effect of selection methods on stability results, capacity of different LLMs and especially reasoning models to be able to automatically generate prompts from similar queries... Furthermore, we provide insights into the relative effectiveness of different prompt generation strategies and their evolution across optimization phases. These findings contribute to both the theoretical understanding of prompt optimization and practical applications in improving LLM performance.
