Table of Contents
Fetching ...

Sifting through the Chaff: On Utilizing Execution Feedback for Ranking the Generated Code Candidates

Zhihong Sun, Yao Wan, Jia Li, Hongyu Zhang, Zhi Jin, Ge Li, Chen Lyu

TL;DR

RankEF addresses the code-ranking bottleneck by learning from execution feedback in a dual-task setup that combines code classification with execution-feedback generation. Through three multi-task strategies and a CodeT5+-based encoder, RankEF learns error causality and improves ranking without executing code during inference. Evaluated on APPS, MBPP, and HumanEval across multiple base models, RankEF consistently outperforms non-execution baselines like CodeRanker and demonstrates better transferability between datasets. The approach offers a practical, safer, and scalable solution for ranking generated code in real-world developer workflows.

Abstract

Large Language Models (LLMs), such as GPT-4, StarCoder, and CodeLlama, are transforming the way developers approach programming by automatically generating code based on given natural language descriptions. Despite advancements, generating syntactically and semantically correct code remains challenging, especially for complex programming tasks. Existing approaches typically generate multiple candidate solutions using LLMs to increase the likelihood of producing correct code. However, selecting the correct code from these candidates-a process known as code ranking-remains a major challenge. Current research on code ranking can be categorized into execution-based and non-execution-based methods. Execution-based methods, although effective, encounter notable limitations, such as scarcity of quality unit tests and security risks. Non-execution-based methods like CodeRanker, which rely solely on classification labels to train a code ranker, struggle to capture subtle errors and provide detailed error insights. Recognizing the strengths and limitations of both approaches, we propose a new method. The key insight of our work is that an effective code ranker is expected to truly comprehend the underlying causes of erroneous code, as relying solely on classification labels is insufficient. Inspired by this, this paper puts forward RankEF, an innovative approach for code ranking that leverages execution feedback. RankEF employs multi-task learning to integrate code classification with execution feedback generation. This approach enables the model to understand the reasons behind incorrect code, distinguishing between correct and incorrect solutions without the need to execute the code during the ranking phase. Experiments on three code generation benchmarks demonstrate that RankEF significantly outperforms the state-of-the-art CodeRanker.

Sifting through the Chaff: On Utilizing Execution Feedback for Ranking the Generated Code Candidates

TL;DR

RankEF addresses the code-ranking bottleneck by learning from execution feedback in a dual-task setup that combines code classification with execution-feedback generation. Through three multi-task strategies and a CodeT5+-based encoder, RankEF learns error causality and improves ranking without executing code during inference. Evaluated on APPS, MBPP, and HumanEval across multiple base models, RankEF consistently outperforms non-execution baselines like CodeRanker and demonstrates better transferability between datasets. The approach offers a practical, safer, and scalable solution for ranking generated code in real-world developer workflows.

Abstract

Large Language Models (LLMs), such as GPT-4, StarCoder, and CodeLlama, are transforming the way developers approach programming by automatically generating code based on given natural language descriptions. Despite advancements, generating syntactically and semantically correct code remains challenging, especially for complex programming tasks. Existing approaches typically generate multiple candidate solutions using LLMs to increase the likelihood of producing correct code. However, selecting the correct code from these candidates-a process known as code ranking-remains a major challenge. Current research on code ranking can be categorized into execution-based and non-execution-based methods. Execution-based methods, although effective, encounter notable limitations, such as scarcity of quality unit tests and security risks. Non-execution-based methods like CodeRanker, which rely solely on classification labels to train a code ranker, struggle to capture subtle errors and provide detailed error insights. Recognizing the strengths and limitations of both approaches, we propose a new method. The key insight of our work is that an effective code ranker is expected to truly comprehend the underlying causes of erroneous code, as relying solely on classification labels is insufficient. Inspired by this, this paper puts forward RankEF, an innovative approach for code ranking that leverages execution feedback. RankEF employs multi-task learning to integrate code classification with execution feedback generation. This approach enables the model to understand the reasons behind incorrect code, distinguishing between correct and incorrect solutions without the need to execute the code during the ranking phase. Experiments on three code generation benchmarks demonstrate that RankEF significantly outperforms the state-of-the-art CodeRanker.
Paper Structure (25 sections, 5 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 5 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: An example after being ranked by CodeRanker.
  • Figure 2: The architecture of RankEF. Phase A. Dataset construction tailored for RankEF, where CLS Inputs and GEN Inputs represent the inputs for the classification task and the inputs for the execution feedback generation task, respectively. Phase B. Diverse multi-task training strategies for RankEF: ① Hard Parameter Sharing. ② Soft Parameter Sharing. ③ Intermediate Fine-Tuning. Phase C. Ranking process with RankEF.
  • Figure 3: An example of templated execution feedback.
  • Figure 4: Results of RankEF's generalized capabilities on datasets with different models (CodeRanker* is trained on the same model data).
  • Figure 5: Two examples of RankEF ranking successes and failures respectively.