Table of Contents
Fetching ...

DOCE: Finding the Sweet Spot for Execution-Based Code Generation

Haau-Sing Li, Patrick Fernandes, Iryna Gurevych, André F. T. Martins

TL;DR

Decoding Objectives for Code Execution is proposed, a comprehensive framework that includes candidate generation, best reranking, minimum Bayes risk (MBR) decoding, and self-debugging as the core components and highlights the importance of execution-based methods and the difference gap between execution-based and execution-free methods.

Abstract

Recently, a diverse set of decoding and reranking procedures have been shown effective for LLM-based code generation. However, a comprehensive framework that links and experimentally compares these methods is missing. We address this by proposing Decoding Objectives for Code Execution, a comprehensive framework that includes candidate generation, $n$-best reranking, minimum Bayes risk (MBR) decoding, and self-debugging as the core components. We then study the contributions of these components through execution-based evaluation metrics. Our findings highlight the importance of execution-based methods and the difference gap between execution-based and execution-free methods. Furthermore, we assess the impact of filtering based on trial unit tests, a simple and effective strategy that has been often overlooked in prior works. We also propose self-debugging on multiple candidates, obtaining state-of-the-art performance on reranking for code generation. We expect our framework to provide a solid guideline for future research on code generation.

DOCE: Finding the Sweet Spot for Execution-Based Code Generation

TL;DR

Decoding Objectives for Code Execution is proposed, a comprehensive framework that includes candidate generation, best reranking, minimum Bayes risk (MBR) decoding, and self-debugging as the core components and highlights the importance of execution-based methods and the difference gap between execution-based and execution-free methods.

Abstract

Recently, a diverse set of decoding and reranking procedures have been shown effective for LLM-based code generation. However, a comprehensive framework that links and experimentally compares these methods is missing. We address this by proposing Decoding Objectives for Code Execution, a comprehensive framework that includes candidate generation, -best reranking, minimum Bayes risk (MBR) decoding, and self-debugging as the core components. We then study the contributions of these components through execution-based evaluation metrics. Our findings highlight the importance of execution-based methods and the difference gap between execution-based and execution-free methods. Furthermore, we assess the impact of filtering based on trial unit tests, a simple and effective strategy that has been often overlooked in prior works. We also propose self-debugging on multiple candidates, obtaining state-of-the-art performance on reranking for code generation. We expect our framework to provide a solid guideline for future research on code generation.
Paper Structure (45 sections, 7 equations, 24 figures, 6 tables)

This paper contains 45 sections, 7 equations, 24 figures, 6 tables.

Figures (24)

  • Figure 1: The Decoding Objectives for Code Execution (DOCE) Framework. Firstly, multiple candidates are generated through sampling. Each candidate then is assigned a score using an $n$-best reranker or MBR, before the candidate with the highest score is returned. Self-Debug can be applied to multiple candidates before scoring as we propose, or the highest score candidate as proposed by chen2024teaching.
  • Figure 2: Performance of reranking and oracle performance over different numbers of generated candidates using CodeLlama-7B-Instruct with temperature 1.6 for HumanEval+ and MBPP+, and 1.2 for LiveCodeBench. Results are averaged across at least 2 runs for LiveCodeBench and 4 runs for the rest.
  • Figure 3: Performance of reranking and oracle over sampling temperatures using CodeLlama-7B-Instruct with 50 generated candidates over 4 runs.
  • Figure 4: Performance of MBR-Exec with fewer unit tests.
  • Figure 5: Improvement in Pass@k of CodeLlama-7B-Instruct after Self-Debug compared to no Self-Debug applied.
  • ...and 19 more figures