Table of Contents
Fetching ...

Codehacks: A Dataset of Adversarial Tests for Competitive Programming Problems Obtained from Codeforces

Max Hort, Leon Moonen

TL;DR

The paper addresses the reliability gap in test-based evaluation of code synthesized by large language models by introducing Codehacks, a large-scale dataset of failure-inducing hacks derived from Codeforces. It collects 288,617 successful hacks across 5,578 problems, with 2,196 matched submissions and problem descriptions, enabling data-driven generation and evaluation of edge-case tests. The authors discuss related datasets, program-synthesis models, and test-generation methods to contextualize Codehacks as a resource for training and benchmarking adversarial and fault-inducing tests. The work has practical impact for improving code-synthesis robustness and test-generation pipelines, offering resources and directions for future adversarial testing and dataset expansion.

Abstract

Software is used in critical applications in our day-to-day life and it is important to ensure its correctness. One popular approach to assess correctness is to evaluate software on tests. If a test fails, it indicates a fault in the software under test; if all tests pass correctly, one may assume that the software is correct. However, the reliability of these results depends on the test suite considered, and there is a risk of false negatives (i.e. software that passes all available tests but contains bugs because some cases are not tested). Therefore, it is important to consider error-inducing test cases when evaluating software. To support data-driven creation of such a test-suite, which is especially of interest for testing software synthesized from large language models, we curate a dataset (Codehacks) of programming problems together with corresponding error-inducing test cases (i.e., "hacks"). This dataset is collected from the wild, in particular, from the Codeforces online judge platform. The dataset comprises 288,617 hacks for 5,578 programming problems, each with a natural language description, as well as the source code for 2,196 submitted solutions to these problems that can be broken with their corresponding hacks. Keywords: competitive programming, language model, dataset

Codehacks: A Dataset of Adversarial Tests for Competitive Programming Problems Obtained from Codeforces

TL;DR

The paper addresses the reliability gap in test-based evaluation of code synthesized by large language models by introducing Codehacks, a large-scale dataset of failure-inducing hacks derived from Codeforces. It collects 288,617 successful hacks across 5,578 problems, with 2,196 matched submissions and problem descriptions, enabling data-driven generation and evaluation of edge-case tests. The authors discuss related datasets, program-synthesis models, and test-generation methods to contextualize Codehacks as a resource for training and benchmarking adversarial and fault-inducing tests. The work has practical impact for improving code-synthesis robustness and test-generation pipelines, offering resources and directions for future adversarial testing and dataset expansion.

Abstract

Software is used in critical applications in our day-to-day life and it is important to ensure its correctness. One popular approach to assess correctness is to evaluate software on tests. If a test fails, it indicates a fault in the software under test; if all tests pass correctly, one may assume that the software is correct. However, the reliability of these results depends on the test suite considered, and there is a risk of false negatives (i.e. software that passes all available tests but contains bugs because some cases are not tested). Therefore, it is important to consider error-inducing test cases when evaluating software. To support data-driven creation of such a test-suite, which is especially of interest for testing software synthesized from large language models, we curate a dataset (Codehacks) of programming problems together with corresponding error-inducing test cases (i.e., "hacks"). This dataset is collected from the wild, in particular, from the Codeforces online judge platform. The dataset comprises 288,617 hacks for 5,578 programming problems, each with a natural language description, as well as the source code for 2,196 submitted solutions to these problems that can be broken with their corresponding hacks. Keywords: competitive programming, language model, dataset

Paper Structure

This paper contains 11 sections, 4 figures.

Figures (4)

  • Figure 1: Example of a hacked Codeforces submission with corresponding problem description and hacking attempt.
  • Figure 2: Structure of the collected dataset.
  • Figure 3: Verdicts of the submitted hacks.
  • Figure 4: Distribution of hacks per problem tag (y-axis) and difficulty (x-axis).