CYCLE: Learning to Self-Refine the Code Generation

Yangruibo Ding; Marcus J. Min; Gail Kaiser; Baishakhi Ray

CYCLE: Learning to Self-Refine the Code Generation

Yangruibo Ding, Marcus J. Min, Gail Kaiser, Baishakhi Ray

TL;DR

This paper proposes Cycle framework, learning to self-refine the faulty generation according to the available feedback, such as the execution results reported by the test suites, and reveals that Cycle successfully maintains, sometimes improves, the quality of one-time code generation, while significantly improving the selfrefinement capability of code LMs.

Abstract

Pre-trained code language models have achieved promising performance in code generation and improved the programming efficiency of human developers. However, their self-refinement capability is typically overlooked by the existing evaluations of code LMs, which focus only on the accuracy of the one-time prediction. For the cases when code LMs fail to implement the correct program, developers actually find it hard to debug and fix the faulty prediction since it is not written by the developers themselves. Unfortunately, our study reveals that code LMs cannot efficiently self-refine their faulty generations as well. In this paper, we propose CYCLE framework, learning to self-refine the faulty generation according to the available feedback, such as the execution results reported by the test suites. We evaluate CYCLE on three popular code generation benchmarks, HumanEval, MBPP, and APPS. The results reveal that CYCLE successfully maintains, sometimes improves, the quality of one-time code generation, while significantly improving the self-refinement capability of code LMs. We implement four variants of CYCLE with varied numbers of parameters across 350M, 1B, 2B, and 3B, and the experiments show that CYCLE consistently boosts the code generation performance, by up to 63.5%, across benchmarks and varied model sizes. We also notice that CYCLE outperforms code LMs that have 3$\times$ more parameters in self-refinement.

CYCLE: Learning to Self-Refine the Code Generation

TL;DR

Abstract

more parameters in self-refinement.

Paper Structure (43 sections, 2 equations, 6 figures, 5 tables)

This paper contains 43 sections, 2 equations, 6 figures, 5 tables.

Introduction
Limitations of code LMs in the exploration mode
Our Approach
Results
Novelty and Contributions.
Overview
Phase-I: Data Preparation for Self-Refinement
Phase-II: Learning to Refine the Faulty Generation
Phase-III: Self-Refinement as Iterative Programming
Approach
Phase-I: Data Preparation
Fine-tune Code LMs with Semantically Correct Code
Prompt Code LMs to Distill the Weaknesses
Phase-II: Learning to Refine the Faulty Generation
Aggregate the Problem Description, Faulty Generation, and Execution Results as a Joint Prior Condition
...and 28 more sections

Figures (6)

Figure 1: Motivation Example. We prompt GPT-3.5 and Cycle to implement a program according to a problem description from the HumanEval programming benchmark (Task No. 106). While both failed to pass the test suite in the acceleration mode, Cycle successfully refined its own generation referring to the execution feedback. In contrast, GPT-3.5 could not self-refine effectively.
Figure 2: Overview of Cycle.
Figure 3: Template to aggregate the problem description, faulty generation, and the execution feedback.
Figure 4: Performance improvement with self-refinement. The blue curve represents the "vanilla code LM baseline", using CodeGen-350M. The orange curve represents the "code LMs fine-tuned with the correct code". The green curve represents Cycle-350M.
Figure 5: Cycle's self-refinement is orthogonal to the top-k generation. Conceptually, top-k generation explores in breadth, while self-refinement improves a specific generation in depth. Empirically, when compared to nucleus sampling (with the temperature of 0.2, 0.6, 0.8) and beam search, self-refinement optimizes the generated code towards execution guidance, while top-k produces more diverse programs that pass a similar amount but complementary test cases.
...and 1 more figures

CYCLE: Learning to Self-Refine the Code Generation

TL;DR

Abstract

CYCLE: Learning to Self-Refine the Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)