Scaling Agentic Verifier for Competitive Coding
Zeyao Ma, Jing Zhang, Xiaokang Zhang, Jiaxi Yang, Zongmeng Zhang, Jiajun Zhang, Yuheng Jing, Lei Zhang, Hao Zheng, Wenting Zhao, Junyang Lin, Binyuan Hui
TL;DR
This work tackles the challenge of achieving correct solutions in a single attempt for competitive programming by leveraging an execution-based verifier that actively searches for highly discriminative test inputs. It introduces Agentic Verifier, a multi-turn, input-generation agent trained via a scalable pipeline of data synthesis, rejection fine-tuning, and agentic reinforcement learning to reveal behavioral discrepancies among candidate solutions. Across four large benchmarks, the approach yields consistent Best@$K$ gains, with particularly strong improvements under larger test-time budgets and on harder problems, and demonstrates clear test-time scaling behavior. The work further argues that benchmark verifiers are imperfect, and the verifier can augment or complement ground-truth tests by exposing counterexamples that reveal false positives, enhancing reliability in large-scale code evaluation.
Abstract
Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling. To address this limitation, we propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs that expose behavioral discrepancies among candidate solutions. Through multi-turn interaction with code execution environments, the verifier iteratively refines the candidate input generator and produces targeted counterexamples rather than blindly sampling inputs. We train the verifier to acquire this discriminative input generation capability via a scalable pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning. Extensive experiments across five competitive programming benchmarks demonstrate consistent improvements over strong execution-based baselines, achieving up to +10-15% absolute gains in Best@K accuracy. Further analysis reveals clear test-time scaling behavior and highlights the verifier's broader potential beyond reranking.
