Table of Contents
Fetching ...

CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions

Jingwei Shi, Xinxiang Yin, Jing Huang, Jinman Zhao, Shengyu Tao

TL;DR

The proposed CodeHacker is an automated agent framework dedicated to generating targeted adversarial test cases that expose latent vulnerabilities in program submissions that boost the performance of RL-trained models on benchmarks like LiveCodeBench.

Abstract

The evaluation of Large Language Models (LLMs) for code generation relies heavily on the quality and robustness of test cases. However, existing benchmarks often lack coverage for subtle corner cases, allowing incorrect solutions to pass. To bridge this gap, we propose CodeHacker, an automated agent framework dedicated to generating targeted adversarial test cases that expose latent vulnerabilities in program submissions. Mimicking the hack mechanism in competitive programming, CodeHacker employs a multi-strategy approach, including stress testing, anti-hash attacks, and logic-specific targeting to break specific code submissions. To ensure the validity and reliability of these attacks, we introduce a Calibration Phase, where the agent iteratively refines its own Validator and Checker via self-generated adversarial probes before evaluating contestant code.Experiments demonstrate that CodeHacker significantly improves the True Negative Rate (TNR) of existing datasets, effectively filtering out incorrect solutions that were previously accepted. Furthermore, generated adversarial cases prove to be superior training data, boosting the performance of RL-trained models on benchmarks like LiveCodeBench.

CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions

TL;DR

The proposed CodeHacker is an automated agent framework dedicated to generating targeted adversarial test cases that expose latent vulnerabilities in program submissions that boost the performance of RL-trained models on benchmarks like LiveCodeBench.

Abstract

The evaluation of Large Language Models (LLMs) for code generation relies heavily on the quality and robustness of test cases. However, existing benchmarks often lack coverage for subtle corner cases, allowing incorrect solutions to pass. To bridge this gap, we propose CodeHacker, an automated agent framework dedicated to generating targeted adversarial test cases that expose latent vulnerabilities in program submissions. Mimicking the hack mechanism in competitive programming, CodeHacker employs a multi-strategy approach, including stress testing, anti-hash attacks, and logic-specific targeting to break specific code submissions. To ensure the validity and reliability of these attacks, we introduce a Calibration Phase, where the agent iteratively refines its own Validator and Checker via self-generated adversarial probes before evaluating contestant code.Experiments demonstrate that CodeHacker significantly improves the True Negative Rate (TNR) of existing datasets, effectively filtering out incorrect solutions that were previously accepted. Furthermore, generated adversarial cases prove to be superior training data, boosting the performance of RL-trained models on benchmarks like LiveCodeBench.
Paper Structure (77 sections, 16 equations, 10 figures, 9 tables, 2 algorithms)

This paper contains 77 sections, 16 equations, 10 figures, 9 tables, 2 algorithms.

Figures (10)

  • Figure 1: The overall architecture of the CodeHacker framework. Phase I (Evaluation Tool Calibration): The agent iteratively refines the judging infrastructure to ensure reliability. This process begins by refining the Validator to strictly enforce input constraints, followed by refining the Checker to eliminate false verdicts. Phase II (Adversarial Case Generation): Utilizing the calibrated tools, the Code Analyst guides three distinct generation strategies (Stress, LLM-based, and Anti-hash) to synthesize adversarial test cases that expose specific vulnerabilities in the contestant's submission.
  • Figure 2: Pass@5 performance comparison on LiveCodeBench. Models trained with adversarial data show consistent improvements. The model trained on our augmented subset achieves the highest accuracy across all difficulty levels.
  • Figure 3: Weak Checker in $CodeContest^+$ for problem Codeforces 25_B
  • Figure 4: Our Checker for problem Codeforces 25_B
  • Figure 5: Wrong Validator in CodeContest+ for problem Codeforces 309_C
  • ...and 5 more figures