Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming
Sama Hadhoud, Alaa Elsetohy, Frederikus Hudi, Jan Christian Blaise Cruz, Steven Halim, Alham Fikri Aji
TL;DR
The paper addresses the problem of evaluating large language models on competitive programming by disentangling problem solving from code generation. It proposes an editorial-centric pipeline where an intermediate natural-language editorial captures algorithmic reasoning, and code is judged separately, enabling explicit measurement of problem-solving ability and implementation fidelity. Through a dataset of 83 ICPC-style problems and 19 models, the study shows gold editorials yield large gains by isolating implementation gaps, while model-generated editorials reveal persistent problem-solving bottlenecks and occasional hallucinations. It further demonstrates that editorials transfer across models, enabling writer–coder pairings that improve performance, and argues for benchmarks that separately assess reasoning and implementation to guide future research. Overall, the work provides a principled framework for diagnosing and improving LLM-based CP solutions and suggests editorial-level evaluation as a scalable proxy for reasoning quality. The findings have practical significance for designing robust CP benchmarks and for modular, cross-model collaboration in AI-assisted problem solving.
Abstract
Large Language Models (LLMs) increasingly succeed on competitive programming problems, yet existing evaluations conflate algorithmic reasoning with code-level implementation. We argue that competitive programming is fundamentally a problem-solving task and propose centering natural-language editorials in both solution generation and evaluation. Generating an editorial prior to code improves solve rates for some LLMs, with substantially larger gains when using expertly written gold editorials. However, even with gold editorials, models continue to struggle with implementation, while the gap between generated and gold editorials reveals a persistent problem-solving bottleneck in specifying correct and complete algorithms. Beyond pass/fail metrics, we diagnose reasoning errors by comparing model-generated editorials to gold standards using expert annotations and validate an LLM-as-a-judge protocol for scalable evaluation. We introduce a dataset of 83 ICPC-style problems with gold editorials and full test suites, and evaluate 19 LLMs, arguing that future benchmarks should explicitly separate problem solving from implementation.
