Table of Contents
Fetching ...

Evolutionary Pre-Prompt Optimization for Mathematical Reasoning

Mathurin Videau, Alessandro Leite, Marc Schoenauer, Olivier Teytaud

TL;DR

This paper explores the optimization of example selection for designing effective CoT pre-prompts and shows that the choice of the optimization algorithm, typically in favor of comparison-based methods such as evolutionary computation, significantly enhances efficacy and feasibility.

Abstract

Recent advancements have highlighted that large language models (LLMs), when given a small set of task-specific examples, demonstrate remarkable proficiency, a capability that extends to complex reasoning tasks. In particular, the combination of few-shot learning with the chain-of-thought (CoT) approach has been pivotal in steering models towards more logically consistent conclusions [Wei et al. 2022b]. This paper explores the optimization of example selection for designing effective CoT pre-prompts and shows that the choice of the optimization algorithm, typically in favor of comparison-based methods such as evolutionary computation, significantly enhances efficacy and feasibility. Specifically, thanks to a limited exploitative and overfitted optimization, Evolutionary Pre-Prompt Optimization (EPPO) brings an improvement over the naive few-shot approach, exceeding 10 absolute points in exact match scores on benchmark datasets such as GSM8k and MathQA. These gains are consistent across various contexts and are further amplified when integrated with self-consistency (SC).

Evolutionary Pre-Prompt Optimization for Mathematical Reasoning

TL;DR

This paper explores the optimization of example selection for designing effective CoT pre-prompts and shows that the choice of the optimization algorithm, typically in favor of comparison-based methods such as evolutionary computation, significantly enhances efficacy and feasibility.

Abstract

Recent advancements have highlighted that large language models (LLMs), when given a small set of task-specific examples, demonstrate remarkable proficiency, a capability that extends to complex reasoning tasks. In particular, the combination of few-shot learning with the chain-of-thought (CoT) approach has been pivotal in steering models towards more logically consistent conclusions [Wei et al. 2022b]. This paper explores the optimization of example selection for designing effective CoT pre-prompts and shows that the choice of the optimization algorithm, typically in favor of comparison-based methods such as evolutionary computation, significantly enhances efficacy and feasibility. Specifically, thanks to a limited exploitative and overfitted optimization, Evolutionary Pre-Prompt Optimization (EPPO) brings an improvement over the naive few-shot approach, exceeding 10 absolute points in exact match scores on benchmark datasets such as GSM8k and MathQA. These gains are consistent across various contexts and are further amplified when integrated with self-consistency (SC).

Paper Structure

This paper contains 30 sections, 2 theorems, 21 equations, 10 figures, 8 tables, 1 algorithm.

Key Result

lemma 1

Assume Hypothesis H1 holds. Let $\mathcal{R}=\{r_1,\dots,r_M\}$ be a finite set of pre-prompts and let $r$ be any (possibly randomized, data-dependent) element taking values in $\mathcal{R}$. Then for any $\varepsilon>0$,

Figures (10)

  • Figure 1: Overview of the proposed CoT optimization process. See \ref{['sec:eppo']} and \ref{['algoOverview']}.
  • Figure 2: Comparison between 2-shot, 4-shot, and 8-shot: training score over the EPPO run for LLaMA2-70B: Each boxplot represents the loss values observed during the whole run. The numbers represent the percentage of exact matches on the training set. Typically, for each number of shots, the bottom part (low EM) corresponds to the beginning of the run, similar to random search, and the performance of these initial few-shot increases greatly with the number of shots.
  • Figure 3: Results of Llama 70B, Exact Matches on the test set without improvement by Self-Consistency. Legends are the Nevergrad algorithm names, prefixed with "s#" ($^\star{}s=8$ is the default few-shot size $s$, unless mentioned). For all algorithms, $\kappa=2$, so the x-axis is exactly the number of binary comparisons between pre-prompts, i.e., budget $b$ in \ref{['algoOverview']}. \ref{['fig:svamp']} presents the results of the transfer to SVAMP. Left: Downsampled GSM8k. For $s\in\{12,16\}$, we observe a clear overfitting: Test performance decreases as the budget increases, consistent with the mathematical analysis. We also observe no overfitting for Random Search, which grows steadily with $s \in \{4,8,12,16\}$. Note that the longest run used 16 GPUs during $\simeq$ 30 hours. Right: Full GSM8k. As only one bit of information is used per iteration (comparison with the best so far), we observe no overfitting until budget 150. Note that the longest run here used 160 GPUs during $\simeq$ 48 hours.
  • Figure 4: Transfer of pre-prompts optimized on full GSM8k (see caption of \ref{['fig:telonew']}-Right) to SVAMP. We observe a good transfer in this context.
  • Figure 5: Comparison between the number of steps in the CoT, Long Cot, and EPPO for GSM8k (4-shot) and MathQA (8-shot) for LLaMA2-70B. The variance is higher for EPPO, with few-shot resulting in more steps in the output.
  • ...and 5 more figures

Theorems & Definitions (4)

  • lemma 1: Union bound for selection over a finite candidate set
  • proof
  • theorem 1: Generalization risk bounds for EPPO under limited feedback
  • proof