Table of Contents
Fetching ...

Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming

Sama Hadhoud, Alaa Elsetohy, Frederikus Hudi, Jan Christian Blaise Cruz, Steven Halim, Alham Fikri Aji

TL;DR

The paper addresses the problem of evaluating large language models on competitive programming by disentangling problem solving from code generation. It proposes an editorial-centric pipeline where an intermediate natural-language editorial captures algorithmic reasoning, and code is judged separately, enabling explicit measurement of problem-solving ability and implementation fidelity. Through a dataset of 83 ICPC-style problems and 19 models, the study shows gold editorials yield large gains by isolating implementation gaps, while model-generated editorials reveal persistent problem-solving bottlenecks and occasional hallucinations. It further demonstrates that editorials transfer across models, enabling writer–coder pairings that improve performance, and argues for benchmarks that separately assess reasoning and implementation to guide future research. Overall, the work provides a principled framework for diagnosing and improving LLM-based CP solutions and suggests editorial-level evaluation as a scalable proxy for reasoning quality. The findings have practical significance for designing robust CP benchmarks and for modular, cross-model collaboration in AI-assisted problem solving.

Abstract

Large Language Models (LLMs) increasingly succeed on competitive programming problems, yet existing evaluations conflate algorithmic reasoning with code-level implementation. We argue that competitive programming is fundamentally a problem-solving task and propose centering natural-language editorials in both solution generation and evaluation. Generating an editorial prior to code improves solve rates for some LLMs, with substantially larger gains when using expertly written gold editorials. However, even with gold editorials, models continue to struggle with implementation, while the gap between generated and gold editorials reveals a persistent problem-solving bottleneck in specifying correct and complete algorithms. Beyond pass/fail metrics, we diagnose reasoning errors by comparing model-generated editorials to gold standards using expert annotations and validate an LLM-as-a-judge protocol for scalable evaluation. We introduce a dataset of 83 ICPC-style problems with gold editorials and full test suites, and evaluate 19 LLMs, arguing that future benchmarks should explicitly separate problem solving from implementation.

Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming

TL;DR

The paper addresses the problem of evaluating large language models on competitive programming by disentangling problem solving from code generation. It proposes an editorial-centric pipeline where an intermediate natural-language editorial captures algorithmic reasoning, and code is judged separately, enabling explicit measurement of problem-solving ability and implementation fidelity. Through a dataset of 83 ICPC-style problems and 19 models, the study shows gold editorials yield large gains by isolating implementation gaps, while model-generated editorials reveal persistent problem-solving bottlenecks and occasional hallucinations. It further demonstrates that editorials transfer across models, enabling writer–coder pairings that improve performance, and argues for benchmarks that separately assess reasoning and implementation to guide future research. Overall, the work provides a principled framework for diagnosing and improving LLM-based CP solutions and suggests editorial-level evaluation as a scalable proxy for reasoning quality. The findings have practical significance for designing robust CP benchmarks and for modular, cross-model collaboration in AI-assisted problem solving.

Abstract

Large Language Models (LLMs) increasingly succeed on competitive programming problems, yet existing evaluations conflate algorithmic reasoning with code-level implementation. We argue that competitive programming is fundamentally a problem-solving task and propose centering natural-language editorials in both solution generation and evaluation. Generating an editorial prior to code improves solve rates for some LLMs, with substantially larger gains when using expertly written gold editorials. However, even with gold editorials, models continue to struggle with implementation, while the gap between generated and gold editorials reveals a persistent problem-solving bottleneck in specifying correct and complete algorithms. Beyond pass/fail metrics, we diagnose reasoning errors by comparing model-generated editorials to gold standards using expert annotations and validate an LLM-as-a-judge protocol for scalable evaluation. We introduce a dataset of 83 ICPC-style problems with gold editorials and full test suites, and evaluate 19 LLMs, arguing that future benchmarks should explicitly separate problem solving from implementation.
Paper Structure (80 sections, 6 equations, 31 figures, 10 tables)

This paper contains 80 sections, 6 equations, 31 figures, 10 tables.

Figures (31)

  • Figure 1: Overview of our evaluation pipeline and editorial annotation scheme. Left: three settings, w/oEd (problem $\rightarrow$ code, baseline), w/GenEd (problem $\rightarrow$ generated editorial $\rightarrow$ code), and w/GoldEd (problem plus gold editorial $\rightarrow$ code). Right: the LLM-generated editorial annotation rubric used to diagnose reasoning quality, covering Problem Understanding (PU-W, PU-M, PU-X, PU-D), Algorithm Description (ALG-TAG vs. Golden-ALG-TAG), and Algorithm Correctness (ALG-COR, correctness type, error type, and severity).
  • Figure 2: Mean virtual rank percentile under w/oEd, w/GenEd, and w/GoldEd (higher is better). Gold editorials yield large and consistent improvements (up to $\sim$0.4), yet even under gold guidance only a small number of models attain high rank percentiles (above $\sim$0.8), with only a handful exceeding $\sim$0.7.
  • Figure 3: Aggregate failure verdict distribution across all editorial settings: Wrong Answer (WA), Time Limit Exceeded (TLE), Runtime Error (RTE), Compile Error (CE), and Memory Limit Exceeded (MLE). Remaining failures are dominated by WA, while TLE becomes more salient for some stronger models (notably Claude)
  • Figure 4: Six-way editorial correctness breakdown labeled by the LLM-as-a-judge. Frontier models produce more judge-Correct plans, but wrong algorithm—i.e., incorrect problem solving—remains the dominant error across many models.
  • Figure 5: Downstream verdict distribution (PASS/WA/TLE/RTE/CE/MLE) conditioned on editorial correctness labels. Editorial correctness labels meaningfully stratify downstream outcomes.
  • ...and 26 more figures