GenX: Mastering Code and Test Generation with Execution Feedback
Nan Wang, Yafei Liu, Chen Chen, Haonan Lu
TL;DR
GenX tackles the problem of generating code and tests when ground-truth data is limited by training code-generation and test-generation models jointly and leveraging execution feedback. It introduces a two-stage data-augmentation framework—execution-guided test generation followed by rejection-sampling-based code generation—coupled with a dual-critic scoring algorithm that operates on a binary pass/fail matrix $P$ to yield $code\_scores$ and $test\_scores$. Experiments on the APPS dataset show that data augmentation with augmented tests (APPS+) improves key metrics and that ranking via the scoring function enhances selection of correct code and tests beyond CodeT in several settings. The approach demonstrates scalable, execution-informed co-generation of code and tests with practical implications for AI-assisted programming and software engineering tools.
Abstract
Recent advancements in language modeling have enabled the translation of natural language into code, and the use of execution feedback to improve code generation. However, these methods often rely heavily on pre-existing test cases, which may not always be available or comprehensive. In this work, we propose a novel approach that concurrently trains a code generation model and a test generation model, utilizing execution feedback to refine and enhance the performance of both. We introduce two strategies for test and code data augmentation and a new scoring function for code and test ranking. We experiment on the APPS dataset and demonstrate that our approach can effectively generate and augment test cases, filter and synthesize correct code solutions, and rank the quality of generated code and tests. The results demonstrate that our models, when iteratively trained with an increasing number of test cases and code solutions, outperform those trained on the original dataset.
