Table of Contents
Fetching ...

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, Takuya Akiba

TL;DR

ALE-Bench addresses the need for long-horizon, score-based optimization benchmarks by harvesting AtCoder Heuristic Contest tasks and providing an interactive Python-based evaluation framework with reproducible environments. It enables AI agents to iteratively refine solutions using test feedback and visualizations, bridging AI capabilities with human algorithm engineering. Experiments show frontier LLMs can match some novice to intermediate human performance but struggle with consistency and long-horizon reasoning, underscoring room for progress. The benchmark also introduces ALE-Agent, a scaffolding-based agent that leverages domain knowledge and diversity-driven search to improve across problems, highlighting the benchmark's utility for developing next-gen AI-assisted optimization.

Abstract

How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

TL;DR

ALE-Bench addresses the need for long-horizon, score-based optimization benchmarks by harvesting AtCoder Heuristic Contest tasks and providing an interactive Python-based evaluation framework with reproducible environments. It enables AI agents to iteratively refine solutions using test feedback and visualizations, bridging AI capabilities with human algorithm engineering. Experiments show frontier LLMs can match some novice to intermediate human performance but struggle with consistency and long-horizon reasoning, underscoring room for progress. The benchmark also introduces ALE-Agent, a scaffolding-based agent that leverages domain knowledge and diversity-driven search to improve across problems, highlighting the benchmark's utility for developing next-gen AI-assisted optimization.

Abstract

How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.

Paper Structure

This paper contains 55 sections, 2 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Overview of ALE-Bench.(Left) ALE-Bench collects past AtCoder Heuristic Contest tasks, hard optimization problems such as routing and scheduling with no known optimum, and ranks submitted programs by score. (Right) ALE-Bench covers evaluation from bare LLMs to scaffolded agents. An agent receives a task and submits code. It can optionally invoke test runs and visualization utilities during this process to iteratively refine its solution like a human participant.
  • Figure 2: Long-horizon score ascent in AHC. Scores at specific ranks at each time point over the two-week AHC014 contest show continual improvement. Line colors mark the color tiers, e.g., perf=2800 (6th) and perf=1200 (379th).
  • Figure 3: Rating and average performance distributions. Cumulative rating and average performance distribution for users with at least 5 participations as of May 1, 2025. Background colors indicate rating tiers.
  • Figure 4: Trends in public score and code file size in the iterative-refinement setting. The plot shows the progression of generated code file sizes alongside the corresponding public evaluation scores over a four-hour period. Points farther to the right represent the later time points.
  • Figure 5: Investigation of contamination. For each model, a scatter plot is shown with contest end dates on the x-axis and performance on the y-axis. The red vertical line indicates the knowledge cutoff date of the model. For DeepSeek-R1, its release date was used as official information is unavailable.
  • ...and 6 more figures