Table of Contents
Fetching ...

ProBench: Benchmarking Large Language Models in Competitive Programming

Lei Yang, Renren Jin, Ling Shi, Jianxiang Peng, Yue Chen, Deyi Xiong

TL;DR

ProBench introduces a benchmarking framework that measures large language models on competitive programming tasks by collecting real ICPC-like problems from Codeforces, Luogu, and Nowcoder and evaluating model solutions through online submissions to the original platforms. It standardizes problem difficulty into Easy/Medium/Hard and normalizes algorithm tags into seven categories, enabling detailed, cross-platform analysis of reasoning depth and code robustness. Empirical results show that reasoning-oriented models, even with fewer parameters (e.g., $20.93$ pass@1 for QwQ-32B-Preview), can outperform larger non-reasoning models, underscoring the importance of reasoning training for programming tasks and providing insights into error types and CoT behavior. The work demonstrates the practicality and fairness of online evaluation for robust code testing and highlights actionable directions for advancing reasoning capabilities in future LLMs with strong programming performance.

Abstract

With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging, large language models (LLMs) have entered a new phase of development. However, existing benchmarks for coding evaluation are gradually inadequate to assess the capability of advanced LLMs in code reasoning. To bridge the gap for high-level code reasoning assessment, we propose ProBench to benchmark LLMs in competitive programming, drawing inspiration from the International Collegiate Programming Contest. ProBench collects a comprehensive set of competitive programming problems from Codeforces, Luogu, and Nowcoder platforms during the period from July to December 2024, obtaining real test results through online submissions to ensure the fairness and accuracy of the evaluation. We establish a unified problem attribute system, including difficulty grading and algorithm tagging. With carefully collected and annotated data in ProBench, we systematically assess 9 latest LLMs in competitive programming across multiple dimensions, including thought chain analysis, error type diagnosis, and reasoning depth evaluation. Experimental results show that QwQ-32B-Preview achieves the best score of 20.93 followed by DeepSeek-V3 with a score of 16.38, suggesting that models trained with specialized reasoning tasks significantly outperform general-purpose models (even larger than reasoning-oriented models) in programming. Further analysis also reveals key areas for programming capability enhancement, e.g., algorithm adaptability and reasoning sufficiency, providing important insights for the future development of reasoning models.

ProBench: Benchmarking Large Language Models in Competitive Programming

TL;DR

ProBench introduces a benchmarking framework that measures large language models on competitive programming tasks by collecting real ICPC-like problems from Codeforces, Luogu, and Nowcoder and evaluating model solutions through online submissions to the original platforms. It standardizes problem difficulty into Easy/Medium/Hard and normalizes algorithm tags into seven categories, enabling detailed, cross-platform analysis of reasoning depth and code robustness. Empirical results show that reasoning-oriented models, even with fewer parameters (e.g., pass@1 for QwQ-32B-Preview), can outperform larger non-reasoning models, underscoring the importance of reasoning training for programming tasks and providing insights into error types and CoT behavior. The work demonstrates the practicality and fairness of online evaluation for robust code testing and highlights actionable directions for advancing reasoning capabilities in future LLMs with strong programming performance.

Abstract

With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging, large language models (LLMs) have entered a new phase of development. However, existing benchmarks for coding evaluation are gradually inadequate to assess the capability of advanced LLMs in code reasoning. To bridge the gap for high-level code reasoning assessment, we propose ProBench to benchmark LLMs in competitive programming, drawing inspiration from the International Collegiate Programming Contest. ProBench collects a comprehensive set of competitive programming problems from Codeforces, Luogu, and Nowcoder platforms during the period from July to December 2024, obtaining real test results through online submissions to ensure the fairness and accuracy of the evaluation. We establish a unified problem attribute system, including difficulty grading and algorithm tagging. With carefully collected and annotated data in ProBench, we systematically assess 9 latest LLMs in competitive programming across multiple dimensions, including thought chain analysis, error type diagnosis, and reasoning depth evaluation. Experimental results show that QwQ-32B-Preview achieves the best score of 20.93 followed by DeepSeek-V3 with a score of 16.38, suggesting that models trained with specialized reasoning tasks significantly outperform general-purpose models (even larger than reasoning-oriented models) in programming. Further analysis also reveals key areas for programming capability enhancement, e.g., algorithm adaptability and reasoning sufficiency, providing important insights for the future development of reasoning models.

Paper Structure

This paper contains 22 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The pass@1 results of all evaluated models on ProBench. Model names in blue are reasoning models while the others are non-reasoning models.
  • Figure 2: Presents the CoT length, measured in characters, for each model, ranked by inference capability.
  • Figure 3: Presents the ratio of the sum of error intervals in the code generated by each model. The interval $[1,4)$ indicates the number of failed code instances within the $[1,4)$ range of test cases.
  • Figure 4: Presents the distribution of error types in the code generated, with the proportion of reasoning errors increasing from the innermost to the outermost layers.
  • Figure 5: Presents the performance across different data structures and algorithms. As the rotation proceeds clockwise, the difficulty of reasoning gradually increases.
  • ...and 3 more figures