DOCE: Finding the Sweet Spot for Execution-Based Code Generation

Haau-Sing Li; Patrick Fernandes; Iryna Gurevych; André F. T. Martins

DOCE: Finding the Sweet Spot for Execution-Based Code Generation

Haau-Sing Li, Patrick Fernandes, Iryna Gurevych, André F. T. Martins

TL;DR

Decoding Objectives for Code Execution is proposed, a comprehensive framework that includes candidate generation, best reranking, minimum Bayes risk (MBR) decoding, and self-debugging as the core components and highlights the importance of execution-based methods and the difference gap between execution-based and execution-free methods.

Abstract

Recently, a diverse set of decoding and reranking procedures have been shown effective for LLM-based code generation. However, a comprehensive framework that links and experimentally compares these methods is missing. We address this by proposing Decoding Objectives for Code Execution, a comprehensive framework that includes candidate generation, $n$-best reranking, minimum Bayes risk (MBR) decoding, and self-debugging as the core components. We then study the contributions of these components through execution-based evaluation metrics. Our findings highlight the importance of execution-based methods and the difference gap between execution-based and execution-free methods. Furthermore, we assess the impact of filtering based on trial unit tests, a simple and effective strategy that has been often overlooked in prior works. We also propose self-debugging on multiple candidates, obtaining state-of-the-art performance on reranking for code generation. We expect our framework to provide a solid guideline for future research on code generation.

DOCE: Finding the Sweet Spot for Execution-Based Code Generation

TL;DR

Abstract

-best reranking, minimum Bayes risk (MBR) decoding, and self-debugging as the core components. We then study the contributions of these components through execution-based evaluation metrics. Our findings highlight the importance of execution-based methods and the difference gap between execution-based and execution-free methods. Furthermore, we assess the impact of filtering based on trial unit tests, a simple and effective strategy that has been often overlooked in prior works. We also propose self-debugging on multiple candidates, obtaining state-of-the-art performance on reranking for code generation. We expect our framework to provide a solid guideline for future research on code generation.

Paper Structure (45 sections, 7 equations, 24 figures, 6 tables)

This paper contains 45 sections, 7 equations, 24 figures, 6 tables.

Introduction
Candidate Generation, Self-Debugging, and Reranking
Candidate generation
Reranking
n-Best Reranking.
Likelihood-based Feature.
Execution-based Feature.
Scores from External Models.
MBR Decoding
Execution-based Metrics.
External Models for MBR.
Self-Debugging
DOCE Framework
Self-Debugging with Single Selected Candidate from Reranking.
Self-Debugging on All Candidates before Reranking.
...and 30 more sections

Figures (24)

Figure 1: The Decoding Objectives for Code Execution (DOCE) Framework. Firstly, multiple candidates are generated through sampling. Each candidate then is assigned a score using an $n$-best reranker or MBR, before the candidate with the highest score is returned. Self-Debug can be applied to multiple candidates before scoring as we propose, or the highest score candidate as proposed by chen2024teaching.
Figure 2: Performance of reranking and oracle performance over different numbers of generated candidates using CodeLlama-7B-Instruct with temperature 1.6 for HumanEval+ and MBPP+, and 1.2 for LiveCodeBench. Results are averaged across at least 2 runs for LiveCodeBench and 4 runs for the rest.
Figure 3: Performance of reranking and oracle over sampling temperatures using CodeLlama-7B-Instruct with 50 generated candidates over 4 runs.
Figure 4: Performance of MBR-Exec with fewer unit tests.
Figure 5: Improvement in Pass@k of CodeLlama-7B-Instruct after Self-Debug compared to no Self-Debug applied.
...and 19 more figures

DOCE: Finding the Sweet Spot for Execution-Based Code Generation

TL;DR

Abstract

DOCE: Finding the Sweet Spot for Execution-Based Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (24)