CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions

Jingwei Shi; Xinxiang Yin; Jing Huang; Jinman Zhao; Shengyu Tao

CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions

Jingwei Shi, Xinxiang Yin, Jing Huang, Jinman Zhao, Shengyu Tao

TL;DR

The proposed CodeHacker is an automated agent framework dedicated to generating targeted adversarial test cases that expose latent vulnerabilities in program submissions that boost the performance of RL-trained models on benchmarks like LiveCodeBench.

Abstract

The evaluation of Large Language Models (LLMs) for code generation relies heavily on the quality and robustness of test cases. However, existing benchmarks often lack coverage for subtle corner cases, allowing incorrect solutions to pass. To bridge this gap, we propose CodeHacker, an automated agent framework dedicated to generating targeted adversarial test cases that expose latent vulnerabilities in program submissions. Mimicking the hack mechanism in competitive programming, CodeHacker employs a multi-strategy approach, including stress testing, anti-hash attacks, and logic-specific targeting to break specific code submissions. To ensure the validity and reliability of these attacks, we introduce a Calibration Phase, where the agent iteratively refines its own Validator and Checker via self-generated adversarial probes before evaluating contestant code.Experiments demonstrate that CodeHacker significantly improves the True Negative Rate (TNR) of existing datasets, effectively filtering out incorrect solutions that were previously accepted. Furthermore, generated adversarial cases prove to be superior training data, boosting the performance of RL-trained models on benchmarks like LiveCodeBench.

CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions

TL;DR

Abstract

Paper Structure (77 sections, 16 equations, 10 figures, 9 tables, 2 algorithms)

This paper contains 77 sections, 16 equations, 10 figures, 9 tables, 2 algorithms.

Introduction
Related Work
Code Benchmark.
Adversarial Data in Competitive Programming.
RLHF and RLVR.
Method
Problem Formulation
LLM agent as a hacking policy.
Hack success and success rate.
CodeHacker Agent
Phase I: Evaluation Tool Calibration
Validator Refinement
Checker Refinement
Anti-Hallucination Pipeline for Checker Update.
Expert Intervention for Complex Verification Logic.
...and 62 more sections

Figures (10)

Figure 1: The overall architecture of the CodeHacker framework. Phase I (Evaluation Tool Calibration): The agent iteratively refines the judging infrastructure to ensure reliability. This process begins by refining the Validator to strictly enforce input constraints, followed by refining the Checker to eliminate false verdicts. Phase II (Adversarial Case Generation): Utilizing the calibrated tools, the Code Analyst guides three distinct generation strategies (Stress, LLM-based, and Anti-hash) to synthesize adversarial test cases that expose specific vulnerabilities in the contestant's submission.
Figure 2: Pass@5 performance comparison on LiveCodeBench. Models trained with adversarial data show consistent improvements. The model trained on our augmented subset achieves the highest accuracy across all difficulty levels.
Figure 3: Weak Checker in $CodeContest^+$ for problem Codeforces 25_B
Figure 4: Our Checker for problem Codeforces 25_B
Figure 5: Wrong Validator in CodeContest+ for problem Codeforces 309_C
...and 5 more figures

CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions

TL;DR

Abstract

CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions

Authors

TL;DR

Abstract

Table of Contents

Figures (10)