Table of Contents
Fetching ...

An In-depth Study of LLM Contributions to the Bin Packing Problem

Julien Herrmann, Guillaume Pallez

TL;DR

The paper reassesses claims that LLM-driven genetic frameworks yield genuine mathematical discoveries in online bin packing. It analyzes the interpretability of LLM-produced heuristics, showing that c12 is relatively readable while c14 and EoH rely on opaque, empirically determined behaviors. The authors propose a simple two-parameter abstraction ab-FirstFit/ab-WorstFit/ab-BestFit that generalizes across distributions and often outperforms the LLM-derived rules, arguing that the observed gains arise from empirical optimization rather than conceptual breakthroughs. They conclude that, for bin packing, LLMs expand the search space but do not yet deliver robust, transferable mathematical insights; rigorous validation and contextualization are essential for claims of discovery.

Abstract

Recent studies have suggested that Large Language Models (LLMs) could provide interesting ideas contributing to mathematical discovery. This claim was motivated by reports that LLM-based genetic algorithms produced heuristics offering new insights into the online bin packing problem under uniform and Weibull distributions. In this work, we reassess this claim through a detailed analysis of the heuristics produced by LLMs, examining both their behavior and interpretability. Despite being human-readable, these heuristics remain largely opaque even to domain experts. Building on this analysis, we propose a new class of algorithms tailored to these specific bin packing instances. The derived algorithms are significantly simpler, more efficient, more interpretable, and more generalizable, suggesting that the considered instances are themselves relatively simple. We then discuss the limitations of the claim regarding LLMs' contribution to this problem, which appears to rest on the mistaken assumption that the instances had previously been studied. Our findings instead emphasize the need for rigorous validation and contextualization when assessing the scientific value of LLM-generated outputs.

An In-depth Study of LLM Contributions to the Bin Packing Problem

TL;DR

The paper reassesses claims that LLM-driven genetic frameworks yield genuine mathematical discoveries in online bin packing. It analyzes the interpretability of LLM-produced heuristics, showing that c12 is relatively readable while c14 and EoH rely on opaque, empirically determined behaviors. The authors propose a simple two-parameter abstraction ab-FirstFit/ab-WorstFit/ab-BestFit that generalizes across distributions and often outperforms the LLM-derived rules, arguing that the observed gains arise from empirical optimization rather than conceptual breakthroughs. They conclude that, for bin packing, LLMs expand the search space but do not yet deliver robust, transferable mathematical insights; rigorous validation and contextualization are essential for claims of discovery.

Abstract

Recent studies have suggested that Large Language Models (LLMs) could provide interesting ideas contributing to mathematical discovery. This claim was motivated by reports that LLM-based genetic algorithms produced heuristics offering new insights into the online bin packing problem under uniform and Weibull distributions. In this work, we reassess this claim through a detailed analysis of the heuristics produced by LLMs, examining both their behavior and interpretability. Despite being human-readable, these heuristics remain largely opaque even to domain experts. Building on this analysis, we propose a new class of algorithms tailored to these specific bin packing instances. The derived algorithms are significantly simpler, more efficient, more interpretable, and more generalizable, suggesting that the considered instances are themselves relatively simple. We then discuss the limitations of the claim regarding LLMs' contribution to this problem, which appears to rest on the mistaken assumption that the instances had previously been studied. Our findings instead emphasize the need for rigorous validation and contextualization when assessing the scientific value of LLM-generated outputs.

Paper Structure

This paper contains 32 sections, 2 equations, 13 figures.

Figures (13)

  • Figure 1: Priority provided by c12. The scale of the $y$-axis is not linear to highlight the difference of priority.
  • Figure 2: Boxplots showing the heuristics performance relative to BestFit over 1000 instances of the $\textsc{Uniform}(20,100)$ distribution with 500 items and bin capacity 150.
  • Figure 3: Average performance of c12 relative to BestFit over 1000 instances of the $\textsc{Uniform}(20,100)$ distribution, as the number of items increases (bin capacity = 150).
  • Figure 4: Plotting $f: s,b\mapsto \texttt{score}(s,b)$, where score is the function from line 2 of c14. In these plots we use: $c=100$, $s_1=20$, $s_2=50$. There is an inflexion point at $\frac{c}{1+\frac{1}{s}+\frac{1}{s^2}}$.
  • Figure 5: Comparing the behavior item-per-items of c14 and WorstFit on the $\textsc{Weibull}(3.0,45)$ distribution with 50k items and bin capacity 100.
  • ...and 8 more figures