Table of Contents
Fetching ...

Scaling Agentic Verifier for Competitive Coding

Zeyao Ma, Jing Zhang, Xiaokang Zhang, Jiaxi Yang, Zongmeng Zhang, Jiajun Zhang, Yuheng Jing, Lei Zhang, Hao Zheng, Wenting Zhao, Junyang Lin, Binyuan Hui

TL;DR

This work tackles the challenge of achieving correct solutions in a single attempt for competitive programming by leveraging an execution-based verifier that actively searches for highly discriminative test inputs. It introduces Agentic Verifier, a multi-turn, input-generation agent trained via a scalable pipeline of data synthesis, rejection fine-tuning, and agentic reinforcement learning to reveal behavioral discrepancies among candidate solutions. Across four large benchmarks, the approach yields consistent Best@$K$ gains, with particularly strong improvements under larger test-time budgets and on harder problems, and demonstrates clear test-time scaling behavior. The work further argues that benchmark verifiers are imperfect, and the verifier can augment or complement ground-truth tests by exposing counterexamples that reveal false positives, enhancing reliability in large-scale code evaluation.

Abstract

Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling. To address this limitation, we propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs that expose behavioral discrepancies among candidate solutions. Through multi-turn interaction with code execution environments, the verifier iteratively refines the candidate input generator and produces targeted counterexamples rather than blindly sampling inputs. We train the verifier to acquire this discriminative input generation capability via a scalable pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning. Extensive experiments across five competitive programming benchmarks demonstrate consistent improvements over strong execution-based baselines, achieving up to +10-15% absolute gains in Best@K accuracy. Further analysis reveals clear test-time scaling behavior and highlights the verifier's broader potential beyond reranking.

Scaling Agentic Verifier for Competitive Coding

TL;DR

This work tackles the challenge of achieving correct solutions in a single attempt for competitive programming by leveraging an execution-based verifier that actively searches for highly discriminative test inputs. It introduces Agentic Verifier, a multi-turn, input-generation agent trained via a scalable pipeline of data synthesis, rejection fine-tuning, and agentic reinforcement learning to reveal behavioral discrepancies among candidate solutions. Across four large benchmarks, the approach yields consistent Best@ gains, with particularly strong improvements under larger test-time budgets and on harder problems, and demonstrates clear test-time scaling behavior. The work further argues that benchmark verifiers are imperfect, and the verifier can augment or complement ground-truth tests by exposing counterexamples that reveal false positives, enhancing reliability in large-scale code evaluation.

Abstract

Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling. To address this limitation, we propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs that expose behavioral discrepancies among candidate solutions. Through multi-turn interaction with code execution environments, the verifier iteratively refines the candidate input generator and produces targeted counterexamples rather than blindly sampling inputs. We train the verifier to acquire this discriminative input generation capability via a scalable pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning. Extensive experiments across five competitive programming benchmarks demonstrate consistent improvements over strong execution-based baselines, achieving up to +10-15% absolute gains in Best@K accuracy. Further analysis reveals clear test-time scaling behavior and highlights the verifier's broader potential beyond reranking.
Paper Structure (22 sections, 7 equations, 9 figures, 3 tables)

This paper contains 22 sections, 7 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Best-of-$N$ performance under input-only execution-based voting as the number of test inputs increases. We compare randomly generated inputs, ground-truth test inputs from the benchmark suites, and inputs generated by our trained agentic verifier. Ground-truth inputs consistently outperform random inputs by a large margin, indicating the inefficiency of naive random input scaling. Verifier-generated inputs significantly improve over random generation and even surpass ground-truth test cases on USACO, demonstrating stronger discriminative power.
  • Figure 2: Overview of the agentic verifier training and inference pipeline. The verifier is trained on a multi-turn input generation task through large-scale data synthesis, rejection fine-tuning on successful interaction trajectories, and agentic reinforcement learning with hard negative solution pairs. At test time, the trained verifier generates discriminative test inputs for execution-based voting to select the best candidate solution.
  • Figure 3: Training dynamics of agentic reinforcement learning on a held-out test set. We report the reward over training steps (left), along with the rates of invalid inputs (middle) and distinguishing inputs (right), computed from test-set rollouts. Reinforcement learning steadily improves discriminative input generation while reducing invalid outputs, indicating stable and effective training.
  • Figure 4: Scaling effect of execution-based methods under the Best@$64$ setting as the number of test inputs increases exponentially. Our agentic verifier exhibits stronger and more stable performance gains compared with representative baselines.
  • Figure 5: An example of a benchmark false positive, where two solutions pass the original test suite but are distinguished by verifier-generated counterexample inputs.
  • ...and 4 more figures