Table of Contents
Fetching ...

Beyond the Hype: Benchmarking LLM-Evolved Heuristics for Bin Packing

Kevin Sim, Quentin Renau, Emma Hart

TL;DR

The paper tackles the problem of evaluating LLM-driven heuristic design for combinatorial optimisation, focusing on online bin packing. It benchmarks five hand-designed heuristics against five LLM-generated ones across a large, diverse set of 12 datasets (6{,}064 instances) using three metrics, and employs Instance Space Analysis to visualize performance across the instance space. The study finds that a hand-designed Best-Fit heuristic generally dominates in terms of general performance, while several LLM-generated heuristics are strong in specific regions but fail to generalise across distributions; tuning offers limited gains and some robustness, whereas certain LLM methods are highly specialised. The authors advocate for rigorous, diverse benchmarking to properly assess the real value of LLM-evolved heuristics, given the substantial cost and potential limited generalisation of these methods.

Abstract

Coupling Large Language Models (LLMs) with Evolutionary Algorithms has recently shown significant promise as a technique to design new heuristics that outperform existing methods, particularly in the field of combinatorial optimisation. An escalating arms race is both rapidly producing new heuristics and improving the efficiency of the processes evolving them. However, driven by the desire to quickly demonstrate the superiority of new approaches, evaluation of the new heuristics produced for a specific domain is often cursory: testing on very few datasets in which instances all belong to a specific class from the domain, and on few instances per class. Taking bin-packing as an example, to the best of our knowledge we conduct the first rigorous benchmarking study of new LLM-generated heuristics, comparing them to well-known existing heuristics across a large suite of benchmark instances using three performance metrics. For each heuristic, we then evolve new instances won by the heuristic and perform an instance space analysis to understand where in the feature space each heuristic performs well. We show that most of the LLM heuristics do not generalise well when evaluated across a broad range of benchmarks in contrast to existing simple heuristics, and suggest that any gains from generating very specialist heuristics that only work in small areas of the instance space need to be weighed carefully against the considerable cost of generating these heuristics.

Beyond the Hype: Benchmarking LLM-Evolved Heuristics for Bin Packing

TL;DR

The paper tackles the problem of evaluating LLM-driven heuristic design for combinatorial optimisation, focusing on online bin packing. It benchmarks five hand-designed heuristics against five LLM-generated ones across a large, diverse set of 12 datasets (6{,}064 instances) using three metrics, and employs Instance Space Analysis to visualize performance across the instance space. The study finds that a hand-designed Best-Fit heuristic generally dominates in terms of general performance, while several LLM-generated heuristics are strong in specific regions but fail to generalise across distributions; tuning offers limited gains and some robustness, whereas certain LLM methods are highly specialised. The authors advocate for rigorous, diverse benchmarking to properly assess the real value of LLM-evolved heuristics, given the substantial cost and potential limited generalisation of these methods.

Abstract

Coupling Large Language Models (LLMs) with Evolutionary Algorithms has recently shown significant promise as a technique to design new heuristics that outperform existing methods, particularly in the field of combinatorial optimisation. An escalating arms race is both rapidly producing new heuristics and improving the efficiency of the processes evolving them. However, driven by the desire to quickly demonstrate the superiority of new approaches, evaluation of the new heuristics produced for a specific domain is often cursory: testing on very few datasets in which instances all belong to a specific class from the domain, and on few instances per class. Taking bin-packing as an example, to the best of our knowledge we conduct the first rigorous benchmarking study of new LLM-generated heuristics, comparing them to well-known existing heuristics across a large suite of benchmark instances using three performance metrics. For each heuristic, we then evolve new instances won by the heuristic and perform an instance space analysis to understand where in the feature space each heuristic performs well. We show that most of the LLM heuristics do not generalise well when evaluated across a broad range of benchmarks in contrast to existing simple heuristics, and suggest that any gains from generating very specialist heuristics that only work in small areas of the instance space need to be weighed carefully against the considerable cost of generating these heuristics.
Paper Structure (11 sections, 4 figures, 4 tables)

This paper contains 11 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Code of LLM-evolved heuristics trained. FS1, FS2, FSW are adapted from code provided in romera2024mathematical, $EoC$ and $EoH$ from code provided in liu2024evolution.
  • Figure 2: (a-c): Heatmaps showing heuristic performance per dataset for 3 metrics: average excess bins; Falkanauer fitness; percentage of instances won. Red boxes highlight the best values by column. (d): Histogram showing the summed excess bins per heuristic over all datasets.
  • Figure 3: Boxplots showing aggregated heuristic performance over all dataset for 2 metrics: average excess bins; percentage of instances won.
  • Figure 4: Perspectives on the ISA projection of the instance space: (a) instances won by heuristics generated by an LLM vs hand-designed, (b) LLM heuristics trained on Weibull or OR distributions, (c) instances won by LLM heuristics originally evolved on datasets from ORLib with BF for comparison, and (d) instances won by LLM heuristics originally evolved on datasets with Weibull distributions, with BF for comparison.