An In-depth Study of LLM Contributions to the Bin Packing Problem
Julien Herrmann, Guillaume Pallez
TL;DR
The paper reassesses claims that LLM-driven genetic frameworks yield genuine mathematical discoveries in online bin packing. It analyzes the interpretability of LLM-produced heuristics, showing that c12 is relatively readable while c14 and EoH rely on opaque, empirically determined behaviors. The authors propose a simple two-parameter abstraction ab-FirstFit/ab-WorstFit/ab-BestFit that generalizes across distributions and often outperforms the LLM-derived rules, arguing that the observed gains arise from empirical optimization rather than conceptual breakthroughs. They conclude that, for bin packing, LLMs expand the search space but do not yet deliver robust, transferable mathematical insights; rigorous validation and contextualization are essential for claims of discovery.
Abstract
Recent studies have suggested that Large Language Models (LLMs) could provide interesting ideas contributing to mathematical discovery. This claim was motivated by reports that LLM-based genetic algorithms produced heuristics offering new insights into the online bin packing problem under uniform and Weibull distributions. In this work, we reassess this claim through a detailed analysis of the heuristics produced by LLMs, examining both their behavior and interpretability. Despite being human-readable, these heuristics remain largely opaque even to domain experts. Building on this analysis, we propose a new class of algorithms tailored to these specific bin packing instances. The derived algorithms are significantly simpler, more efficient, more interpretable, and more generalizable, suggesting that the considered instances are themselves relatively simple. We then discuss the limitations of the claim regarding LLMs' contribution to this problem, which appears to rest on the mistaken assumption that the instances had previously been studied. Our findings instead emphasize the need for rigorous validation and contextualization when assessing the scientific value of LLM-generated outputs.
