AutoTest: Evolutionary Code Solution Selection with Test Cases
Zhihua Duan, Jialin Wang
TL;DR
AutoTest tackles the problem of selecting correct code solutions among multiple candidates by integrating automated test-case generation with code execution and an evolutionary genetic algorithm for ranking. It leverages large language models to generate both solutions and tests, then uses execution feedback to form a consensus and apply GA-based selection with parameters $\alpha$ and $\beta$ to identify the best solution. The method yields notable gains on the HumanEval benchmark, including roughly a $10$ percentage-point improvement in pass@1 over baselines and competitive pass@k performance, such as $85.0\%$ on pass@10 for code-davinci-002. This approach reduces reliance on any single model and demonstrates a viable pathway for robust solution selection in code generation tasks with practical impact on automated programming assistants and evaluation benchmarks.
Abstract
With the development of code generation techniques, selecting the correct code solution from multiple candidate solutions has become a crucial task. This study proposes AutoTest, a novel technique that combines automated test case generation with code solution execution to optimize the selection process using an evolutionary genetic algorithm. Firstly, AutoTest utilizes large pre-trained language models such as codegen-16B, code-davinci-002, and incoder-6B to provide code solutions and their corresponding test cases. Then, by executing the code solutions and evaluating their performance on the test cases, a consensus set is formed. Fine-grained ranking is achieved through the selection, mutation, and crossover mechanisms based on the evolutionary genetic algorithm, with the adjustment of alpha and beta parameters. Finally, the best code solution is chosen. AutoTest demonstrates significant performance improvements on the HumanEval benchmark test. The HumanEval dataset consists of 164 programming problems, and AutoTest achieves approximately a 10% improvement over the baseline method in terms of pass@1 score.
