Table of Contents
Fetching ...

Diverse Prompts: Illuminating the Prompt Space of Large Language Models with MAP-Elites

Gabriel Machado Santos, Rita Maria da Silva Julia, Marcelo Zanchetta do Nascimento

TL;DR

This work addresses how prompt structure affects LLM task performance and introduces a CFG-based representation combined with MAP-Elites to systematically map the prompt space for quality and diversity. The approach yields diverse, high-performing prompts across seven BigBench Lite tasks and four sub-10B LLMs, revealing task-dependent structure-performance relationships. The findings offer practical guidance for adaptive prompt design and demonstrate the potential of quality-diversity search to enhance in-context learning. Overall, the framework provides a scalable, principled method for exploring and exploiting prompt architectures in real-world NLP settings.

Abstract

Prompt engineering is essential for optimizing large language models (LLMs), yet the link between prompt structures and task performance remains underexplored. This work introduces an evolutionary approach that combines context-free grammar (CFG) with the MAP-Elites algorithm to systematically explore the prompt space. Our method prioritizes quality and diversity, generating high-performing and structurally varied prompts while analyzing their alignment with diverse tasks by varying traits such as the number of examples (shots) and reasoning depth. By systematically mapping the phenotypic space, we reveal how structural variations influence LLM performance, offering actionable insights for task-specific and adaptable prompt design. Evaluated on seven BigBench Lite tasks across multiple LLMs, our results underscore the critical interplay of quality and diversity, advancing the effectiveness and versatility of LLMs.

Diverse Prompts: Illuminating the Prompt Space of Large Language Models with MAP-Elites

TL;DR

This work addresses how prompt structure affects LLM task performance and introduces a CFG-based representation combined with MAP-Elites to systematically map the prompt space for quality and diversity. The approach yields diverse, high-performing prompts across seven BigBench Lite tasks and four sub-10B LLMs, revealing task-dependent structure-performance relationships. The findings offer practical guidance for adaptive prompt design and demonstrate the potential of quality-diversity search to enhance in-context learning. Overall, the framework provides a scalable, principled method for exploring and exploiting prompt architectures in real-world NLP settings.

Abstract

Prompt engineering is essential for optimizing large language models (LLMs), yet the link between prompt structures and task performance remains underexplored. This work introduces an evolutionary approach that combines context-free grammar (CFG) with the MAP-Elites algorithm to systematically explore the prompt space. Our method prioritizes quality and diversity, generating high-performing and structurally varied prompts while analyzing their alignment with diverse tasks by varying traits such as the number of examples (shots) and reasoning depth. By systematically mapping the phenotypic space, we reveal how structural variations influence LLM performance, offering actionable insights for task-specific and adaptable prompt design. Evaluated on seven BigBench Lite tasks across multiple LLMs, our results underscore the critical interplay of quality and diversity, advancing the effectiveness and versatility of LLMs.

Paper Structure

This paper contains 19 sections, 4 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Feature space coverage for the Logical Deduction (LD3) dataset using MAP-Elites (left) and Random Search (right). The x-axis represents the number of examples, the y-axis denotes reasoning depth, and points indicate individuals, colored by performance (0.0–1.0) and shaped by role context inclusion. High performance was achieved only with zero-shot prompts, emphasizing the effectiveness of simpler designs.
  • Figure 2: Heatmap showing correlations between prompt features and LLM performance across datasets. Significant correlations (p $<$ 0.05) are marked with $^{1}$.