Table of Contents
Fetching ...

Understanding the Importance of Evolutionary Search in Automated Heuristic Design with Large Language Models

Rui Zhang, Fei Liu, Xi Lin, Zhenkun Wang, Zhichao Lu, Qingfu Zhang

TL;DR

The paper investigates whether large language models (LLMs) alone can autonomously design effective heuristics for Automated Heuristic Design (AHD) or whether coupling LLMs with an evolutionary search (EPS) process is essential. It introduces a large-scale benchmark encompassing four AHD problems, four LLM-based EPS methods plus a simple ($1+1$-EPS) baseline, nine LLMs, and five runs, to analyze the necessity of search and the progress of LLM-based EPS. Key findings show that standalone LLMs, even with large budgets or high capacity, underperform compared with search-augmented approaches, and that EPS methods exhibit problem- and model-dependent performance with no universally best method. The work highlights substantial search costs and variability across tasks, advocating for more diverse benchmarks and open-source reproducibility to guide future EPS algorithm development and LLM usage in AHD applications.

Abstract

Automated heuristic design (AHD) has gained considerable attention for its potential to automate the development of effective heuristics. The recent advent of large language models (LLMs) has paved a new avenue for AHD, with initial efforts focusing on framing AHD as an evolutionary program search (EPS) problem. However, inconsistent benchmark settings, inadequate baselines, and a lack of detailed component analysis have left the necessity of integrating LLMs with search strategies and the true progress achieved by existing LLM-based EPS methods to be inadequately justified. This work seeks to fulfill these research queries by conducting a large-scale benchmark comprising four LLM-based EPS methods and four AHD problems across nine LLMs and five independent runs. Our extensive experiments yield meaningful insights, providing empirical grounding for the importance of evolutionary search in LLM-based AHD approaches, while also contributing to the advancement of future EPS algorithmic development. To foster accessibility and reproducibility, we have fully open-sourced our benchmark and corresponding results.

Understanding the Importance of Evolutionary Search in Automated Heuristic Design with Large Language Models

TL;DR

The paper investigates whether large language models (LLMs) alone can autonomously design effective heuristics for Automated Heuristic Design (AHD) or whether coupling LLMs with an evolutionary search (EPS) process is essential. It introduces a large-scale benchmark encompassing four AHD problems, four LLM-based EPS methods plus a simple (-EPS) baseline, nine LLMs, and five runs, to analyze the necessity of search and the progress of LLM-based EPS. Key findings show that standalone LLMs, even with large budgets or high capacity, underperform compared with search-augmented approaches, and that EPS methods exhibit problem- and model-dependent performance with no universally best method. The work highlights substantial search costs and variability across tasks, advocating for more diverse benchmarks and open-source reproducibility to guide future EPS algorithm development and LLM usage in AHD applications.

Abstract

Automated heuristic design (AHD) has gained considerable attention for its potential to automate the development of effective heuristics. The recent advent of large language models (LLMs) has paved a new avenue for AHD, with initial efforts focusing on framing AHD as an evolutionary program search (EPS) problem. However, inconsistent benchmark settings, inadequate baselines, and a lack of detailed component analysis have left the necessity of integrating LLMs with search strategies and the true progress achieved by existing LLM-based EPS methods to be inadequately justified. This work seeks to fulfill these research queries by conducting a large-scale benchmark comprising four LLM-based EPS methods and four AHD problems across nine LLMs and five independent runs. Our extensive experiments yield meaningful insights, providing empirical grounding for the importance of evolutionary search in LLM-based AHD approaches, while also contributing to the advancement of future EPS algorithmic development. To foster accessibility and reproducibility, we have fully open-sourced our benchmark and corresponding results.
Paper Structure (23 sections, 1 equation, 9 figures, 9 tables, 1 algorithm)

This paper contains 23 sections, 1 equation, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: An illustration of the LLM-based EPS paradigm, with respect to the GP-based paradigm (top section), for automated heuristic design.
  • Figure 2: Box plot comparison on the performance of the top-{(a) $5$‰, (b) $1\%$} heuristics generated by GPT-3.5 under various query budgets. The performance is measured as the relative distance to the best-known optimum ($\Delta_{\rm d}$) aggregated over four AHD problems and five independent runs. Lower $\Delta_{\rm d}$ indicates better performance. The performance of the simple baseline ($1+1$)-EPS with GPT-3.5 under a small query budget of 500 is also provided as a reference.
  • Figure 3: Box plot comparison on the performance of the top-$5$‰ heuristics generated by LLMs with varying capacities under 10,000 query budgets. We group LLMs into two categories: (1) LLMs specialized for coding tasks (with background shaded in ) and (2) general-purpose LLMs (with background shaded in ). Then, the LLMs are arranged in the order of ascending model size within each group. The color scale of the boxes corresponds with the scores on HumanEval chen2021evaluating. The performance is measured as the relative distance to the best-known optimum ($\Delta_{\rm d}$) aggregated over four AHD problems and five independent runs. Lower $\Delta_{\rm d}$ indicates better performance. The performance of the simple baseline ($1+1$)-EPS with CodeLlama-7B is also provided as a reference.
  • Figure 4: Convergence curve comparison on the performance of the top-1 heuristics achieved by various EPS methods. The mean relative distances to the best-known optimum ($\Delta_{\rm d}$) averaged over five independent runs are denoted with markers, while the standard deviations of $\Delta_{\rm d}$ are shown with the shaded regions.
  • Figure 5: Radar plot comparison on the performance of the top-1 heuristics achieved by various EPS methods with different choices of LLMs. The radius of each vertex is calculated by the mean relative distances to the best-known optimum ($\Delta_{\rm d}$) averaged over five independent runs; hence, a smaller radius/enclosed area indicates better performance.
  • ...and 4 more figures