Table of Contents
Fetching ...

LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?

Kaijian Zou, Aaron Xiong, Yunxiang Zhang, Frederick Zhang, Yueqi Ren, Jirong Yang, Ayoung Lee, Shitanshu Bhushan, Lu Wang

TL;DR

LiveOIBench addresses key gaps in coding benchmarks by aggregating $403$ tasks from $72$ IOI-style contests across $14$ Informatics Olympiads (2023–2025), paired with expert private tests and official human rankings, all evaluated offline for reproducibility. Benchmark results across $34$ models show GPT-5 attaining about the 82nd percentile but not matching elite humans, while open-weight models improve with increased reasoning budgets, narrowing some gaps. Algorithmic difficulty analyses reveal weaknesses in dynamic programming and related data-structuring problems, and reasoning-trace studies show stronger models allocate tokens toward planning and analysis rather than blind exploration. The work demonstrates robust methods for contamination control and offers a pathway for future improvements via targeted reasoning enhancements and inference-time scaling.

Abstract

Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a comprehensive benchmark featuring 403 expert-curated Olympiad-level competitive programming problems, each with an average of 60 expert-designed test cases. The problems are sourced directly from 72 official contests of 14 Informatics Olympiads in different regions conducted between 2023 and 2025. LiveOIBench distinguishes itself through four key features: (1) meticulously curated high-quality tasks with detailed subtask rubrics and extensive private test cases; (2) direct integration of elite contestant performance data to enable informative comparison against top-performing humans; (3) planned continuous, contamination-free updates from newly released Olympiad problems; and (4) a self-contained evaluation system facilitating offline and easy-to-reproduce assessments. Benchmarking 34 popular general-purpose and reasoning LLMs, we find that GPT-5 achieves a notable 81.76th percentile, a strong result that nonetheless falls short of top human contestants, who usually place above 90th. In contrast, among open-weight reasoning models, GPT-OSS-120B achieves only a 60th percentile, underscoring significant capability disparities from frontier closed models. Detailed analyses indicate that robust reasoning models prioritize precise problem analysis over excessive exploration, suggesting future models should emphasize structured analysis and minimize unnecessary exploration. All data, code, and leaderboard results are publicly available on our website.

LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?

TL;DR

LiveOIBench addresses key gaps in coding benchmarks by aggregating tasks from IOI-style contests across Informatics Olympiads (2023–2025), paired with expert private tests and official human rankings, all evaluated offline for reproducibility. Benchmark results across models show GPT-5 attaining about the 82nd percentile but not matching elite humans, while open-weight models improve with increased reasoning budgets, narrowing some gaps. Algorithmic difficulty analyses reveal weaknesses in dynamic programming and related data-structuring problems, and reasoning-trace studies show stronger models allocate tokens toward planning and analysis rather than blind exploration. The work demonstrates robust methods for contamination control and offers a pathway for future improvements via targeted reasoning enhancements and inference-time scaling.

Abstract

Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a comprehensive benchmark featuring 403 expert-curated Olympiad-level competitive programming problems, each with an average of 60 expert-designed test cases. The problems are sourced directly from 72 official contests of 14 Informatics Olympiads in different regions conducted between 2023 and 2025. LiveOIBench distinguishes itself through four key features: (1) meticulously curated high-quality tasks with detailed subtask rubrics and extensive private test cases; (2) direct integration of elite contestant performance data to enable informative comparison against top-performing humans; (3) planned continuous, contamination-free updates from newly released Olympiad problems; and (4) a self-contained evaluation system facilitating offline and easy-to-reproduce assessments. Benchmarking 34 popular general-purpose and reasoning LLMs, we find that GPT-5 achieves a notable 81.76th percentile, a strong result that nonetheless falls short of top human contestants, who usually place above 90th. In contrast, among open-weight reasoning models, GPT-OSS-120B achieves only a 60th percentile, underscoring significant capability disparities from frontier closed models. Detailed analyses indicate that robust reasoning models prioritize precise problem analysis over excessive exploration, suggesting future models should emphasize structured analysis and minimize unnecessary exploration. All data, code, and leaderboard results are publicly available on our website.

Paper Structure

This paper contains 51 sections, 1 equation, 18 figures, 14 tables.

Figures (18)

  • Figure 1: LiveOIBench. Average human percentile across all contests versus average completion tokens per problem. The dashed boxes highlight the lower performance range of non-thinking LLMs. OpenAI models lie on the token-efficiency frontier, achieving higher human percentile with fewer tokens. Despite improvements, all evaluated models remain below the Gold medal threshold (top $10$% human performance), indicating substantial room for progress.
  • Figure 2: Parallel Scaling displays the Pass@k performance, illustrating how the success rate improves as more solutions (k) are sampled per problem. GPT-5 shows the highest sample efficiency and overall performance ceiling.
  • Figure 3: Reasoning Trace Analyses. We categorize eight reasoning behaviors and divide them into five groups: Analysis (Algorithm/Proof analysis and Complexity Analysis), Planning (Problem Restatement and Subgoal Setting), Exploration (Backtracking and Dead‑end recognition), Implementation (Pseudo implementation), Verification (Test Case Verification).
  • Figure 4: Submission status distribution for six selected models. The models are sorted based on performance from left to right. Solutions by stronger reasoning models show substantial reductions in failure types of time limit, memory limit, and compilation errors.
  • Figure 5: No significant positive correlation is observed between GPT-OSS-120B's familiarity with task statements and solutions (normalized via min-max scaling) and its performance, indicating that higher familiarity does not necessarily translate to better outcomes.
  • ...and 13 more figures