Table of Contents
Fetching ...

HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

Hongzheng Chen, Yingheng Wang, Yaohui Cai, Hins Hu, Jiajie Li, Shirley Huang, Chenhui Deng, Rongjian Liang, Shufeng Kong, Haoxing Ren, Samitha Samaranayake, Carla P. Gomes, Zhiru Zhang

TL;DR

HeuriGym introduces an agentic benchmark to evaluate LLM-crafted heuristics for combinatorial optimization by enabling iterative, tool-augmented problem solving and execution feedback. It defines the Quality-Yield Index (QYI) to jointly quantify solution quality and success rate across nine real-world CO tasks, revealing persistent gaps between state-of-the-art LLMs and expert baselines. The framework combines formal problem descriptions, self-contained prompt design, and an end-to-end evaluation loop with problem-specific verifiers, offering a practical, open-source testbed for advancing LLM-driven algorithm design. Key findings show iterative refinement boosts pass rates, but top models still achieve only moderate QYI (around 0.6), underscoring the need for longer context, self-verification, and more effective search strategies in AI agents for scientific and engineering problems.

Abstract

While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.

HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

TL;DR

HeuriGym introduces an agentic benchmark to evaluate LLM-crafted heuristics for combinatorial optimization by enabling iterative, tool-augmented problem solving and execution feedback. It defines the Quality-Yield Index (QYI) to jointly quantify solution quality and success rate across nine real-world CO tasks, revealing persistent gaps between state-of-the-art LLMs and expert baselines. The framework combines formal problem descriptions, self-contained prompt design, and an end-to-end evaluation loop with problem-specific verifiers, offering a practical, open-source testbed for advancing LLM-driven algorithm design. Key findings show iterative refinement boosts pass rates, but top models still achieve only moderate QYI (around 0.6), underscoring the need for longer context, self-verification, and more effective search strategies in AI agents for scientific and engineering problems.

Abstract

While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.

Paper Structure

This paper contains 46 sections, 3 equations, 5 figures, 26 tables.

Figures (5)

  • Figure 1: Overview of the HeuriGym agentic framework for heuristic program generation, execution, and verification. We use operator scheduling as an example for the problem description.
  • Figure 2: Quality-Yield Index and estimated API cost of different models.
  • Figure 3: Error classifications.
  • Figure 4: Quality-Yield tradeoff.
  • Figure 5: One iterative example of GPT-o4-mini-high on the technology mapping problem.