Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition?
Xiangyang Li, Xiaopeng Li, Kuicai Dong, Quanhu Zhang, Rongju Ruan, Xinyi Dai, Xiaoshuang Liu, Shengchun Xu, Yasheng Wang, Ruiming Tang
TL;DR
HLCE ( Humanity's Last Code Exam ) presents a 235-problem benchmark drawn from IOI and ICPC World Finals to rigorously test advanced LLMs on difficult algorithmic coding tasks, including interactive variants. The authors implement a harmonized online-offline evaluation framework, introduce a self-recognition task, and examine test-time scaling laws across 12 SOTA models, revealing substantial gaps to human medalists and indicating that current upper bounds are far from saturated. Key findings show that even the strongest reasoning LLMs achieve only around 11–16% pass@1 on HLCE, while some models can reach medal-level performance in historical competitions but still lag human top-tier performers, especially at IOI. The work also demonstrates that test-time computation improves performance and that measures to prevent data leakage are essential, proposing HLCE as a milestone for advancing high-performance reasoning and human-AI collaborative programming. The public availability of code and data aims to accelerate future research toward code LLMs capable of rivaling elite human competitors in competitive programming.
Abstract
Code generation is a core capability of large language models (LLMs), yet mainstream benchmarks (e.g., APPs and LiveCodeBench) contain questions with medium-level difficulty and pose no challenge to advanced LLMs. To better reflected the advanced reasoning and code generation ability, We introduce Humanity's Last Code Exam (HLCE), comprising 235 most challenging problems from the International Collegiate Programming Contest (ICPC World Finals) and the International Olympiad in Informatics (IOI) spanning 2010 - 2024. As part of HLCE, we design a harmonized online-offline sandbox that guarantees fully reproducible evaluation. Through our comprehensive evaluation, we observe that even the strongest reasoning LLMs: o4-mini(high) and Gemini-2.5 Pro, achieve pass@1 rates of only 15.9% and 11.4%, respectively. Meanwhile, we propose a novel "self-recognition" task to measure LLMs' awareness of their own capabilities. Results indicate that LLMs' self-recognition abilities are not proportionally correlated with their code generation performance. Finally, our empirical validation of test-time scaling laws reveals that current advanced LLMs have substantial room for improvement on complex programming tasks. We expect HLCE to become a milestone challenge for code generation and to catalyze advances in high-performance reasoning and human-AI collaborative programming. Our code and dataset are also public available(https://github.com/Humanity-s-Last-Code-Exam/HLCE).
