Table of Contents
Fetching ...

GenX: Mastering Code and Test Generation with Execution Feedback

Nan Wang, Yafei Liu, Chen Chen, Haonan Lu

TL;DR

GenX tackles the problem of generating code and tests when ground-truth data is limited by training code-generation and test-generation models jointly and leveraging execution feedback. It introduces a two-stage data-augmentation framework—execution-guided test generation followed by rejection-sampling-based code generation—coupled with a dual-critic scoring algorithm that operates on a binary pass/fail matrix $P$ to yield $code\_scores$ and $test\_scores$. Experiments on the APPS dataset show that data augmentation with augmented tests (APPS+) improves key metrics and that ranking via the scoring function enhances selection of correct code and tests beyond CodeT in several settings. The approach demonstrates scalable, execution-informed co-generation of code and tests with practical implications for AI-assisted programming and software engineering tools.

Abstract

Recent advancements in language modeling have enabled the translation of natural language into code, and the use of execution feedback to improve code generation. However, these methods often rely heavily on pre-existing test cases, which may not always be available or comprehensive. In this work, we propose a novel approach that concurrently trains a code generation model and a test generation model, utilizing execution feedback to refine and enhance the performance of both. We introduce two strategies for test and code data augmentation and a new scoring function for code and test ranking. We experiment on the APPS dataset and demonstrate that our approach can effectively generate and augment test cases, filter and synthesize correct code solutions, and rank the quality of generated code and tests. The results demonstrate that our models, when iteratively trained with an increasing number of test cases and code solutions, outperform those trained on the original dataset.

GenX: Mastering Code and Test Generation with Execution Feedback

TL;DR

GenX tackles the problem of generating code and tests when ground-truth data is limited by training code-generation and test-generation models jointly and leveraging execution feedback. It introduces a two-stage data-augmentation framework—execution-guided test generation followed by rejection-sampling-based code generation—coupled with a dual-critic scoring algorithm that operates on a binary pass/fail matrix to yield and . Experiments on the APPS dataset show that data augmentation with augmented tests (APPS+) improves key metrics and that ranking via the scoring function enhances selection of correct code and tests beyond CodeT in several settings. The approach demonstrates scalable, execution-informed co-generation of code and tests with practical implications for AI-assisted programming and software engineering tools.

Abstract

Recent advancements in language modeling have enabled the translation of natural language into code, and the use of execution feedback to improve code generation. However, these methods often rely heavily on pre-existing test cases, which may not always be available or comprehensive. In this work, we propose a novel approach that concurrently trains a code generation model and a test generation model, utilizing execution feedback to refine and enhance the performance of both. We introduce two strategies for test and code data augmentation and a new scoring function for code and test ranking. We experiment on the APPS dataset and demonstrate that our approach can effectively generate and augment test cases, filter and synthesize correct code solutions, and rank the quality of generated code and tests. The results demonstrate that our models, when iteratively trained with an increasing number of test cases and code solutions, outperform those trained on the original dataset.

Paper Structure

This paper contains 26 sections, 3 figures, 7 tables, 2 algorithms.

Figures (3)

  • Figure 1: The execution roles under different situations, where W represents code solution, and T represents test case. When there is a ground truth code solution, we can correct the wrong output of the generated tests. When there are ground truth test cases available, we can use them to filter out incorrect code solutions generated and synthesize more correct ones. When only generated code and tests exist, we can execute them against each other for ranking their quality.
  • Figure 2: The experimental results of test generation and code generation when replaying historical samples during the second iteration. Score is defined as the product of pass rate and pass num.
  • Figure 3: The experimental results under three iterations and four different versions (v1 to v4) of data for test generation and code generation. Score is defined as the product of pass rate and pass num.