Table of Contents
Fetching ...

Exploring Data-Efficient Adaptation of Large Language Models for Code Generation

Xue Jiang, Yihong Dong, Zhiyuan Fan, Zhi Jin, Wenpin Jiao, Ge Li

TL;DR

The paper addresses the challenge of adapting large language models (LLMs) for code generation when training data is scarce. It introduces DEED, a data-efficient adaptation pipeline driven by error-driven learning, which collects model errors, automatically revises erroneous code with Self-Revise, fine-tunes the model on revised examples, and iterates this cycle. Across five public benchmarks and multiple LLMs, DEED consistently outperforms mainstream adaptation techniques, achieving substantial improvements in Pass@k metrics and demonstrating robustness to model size and dataset variations. The approach reduces data requirements and increases learning efficiency, offering a practical pathway for domain-specific code generation where access to large labeled datasets is limited.

Abstract

Although Large Language Models (LLMs) have made significant progress in code generation, they still struggle with code generation tasks in specific scenarios. These scenarios usually necessitate the adaptation of LLMs to fulfill specific needs, but the limited training data available in practice leads to poor code generation performance. Therefore, how to effectively adapt LLMs to new scenarios with few training data is a major challenge for current code generation. In this paper, we propose a novel adaptation approach named DEED, which stands for Data-Efficient adaptation with Error-Driven learning for code generation. DEED leverages the errors made by LLMs as learning opportunities, using error revision to overcome their own shortcomings, thus achieving efficient learning. Specifically, DEED involves identifying error code generated by LLMs, employing Self-Revise for code revision, optimizing the model with revised code, and iteratively adapting the process for continuous improvement. Experimental results show that, compared to other mainstream fine-tuning approaches, DEED achieves superior performance with few training data, showing an average relative improvement of 46.2% in Pass@1 on multiple code generation benchmarks. We also validate the effectiveness of Self-Revise, which generates revised code that optimizes the model more efficiently compared to the code samples from datasets. Moreover, DEED consistently demonstrates strong performance across various LLMs, underscoring its applicability.

Exploring Data-Efficient Adaptation of Large Language Models for Code Generation

TL;DR

The paper addresses the challenge of adapting large language models (LLMs) for code generation when training data is scarce. It introduces DEED, a data-efficient adaptation pipeline driven by error-driven learning, which collects model errors, automatically revises erroneous code with Self-Revise, fine-tunes the model on revised examples, and iterates this cycle. Across five public benchmarks and multiple LLMs, DEED consistently outperforms mainstream adaptation techniques, achieving substantial improvements in Pass@k metrics and demonstrating robustness to model size and dataset variations. The approach reduces data requirements and increases learning efficiency, offering a practical pathway for domain-specific code generation where access to large labeled datasets is limited.

Abstract

Although Large Language Models (LLMs) have made significant progress in code generation, they still struggle with code generation tasks in specific scenarios. These scenarios usually necessitate the adaptation of LLMs to fulfill specific needs, but the limited training data available in practice leads to poor code generation performance. Therefore, how to effectively adapt LLMs to new scenarios with few training data is a major challenge for current code generation. In this paper, we propose a novel adaptation approach named DEED, which stands for Data-Efficient adaptation with Error-Driven learning for code generation. DEED leverages the errors made by LLMs as learning opportunities, using error revision to overcome their own shortcomings, thus achieving efficient learning. Specifically, DEED involves identifying error code generated by LLMs, employing Self-Revise for code revision, optimizing the model with revised code, and iteratively adapting the process for continuous improvement. Experimental results show that, compared to other mainstream fine-tuning approaches, DEED achieves superior performance with few training data, showing an average relative improvement of 46.2% in Pass@1 on multiple code generation benchmarks. We also validate the effectiveness of Self-Revise, which generates revised code that optimizes the model more efficiently compared to the code samples from datasets. Moreover, DEED consistently demonstrates strong performance across various LLMs, underscoring its applicability.
Paper Structure (26 sections, 12 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 26 sections, 12 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: The performance of direct generation, fine-tuning, and our proposed DEED on MBPP dataset under the circumstance of limited data. The numbers on the bars indicate the training data used by different methods.
  • Figure 2: An overview of the proposed DEED and its differences from traditional fine-tuning methods.
  • Figure 3: Illustration of automatic code revision.
  • Figure 4: Cases for two settings of Self-Revise, where "-" and "+" respectively indicate lines of code before and after revision.
  • Figure 5: Performance analysis with varying sizes of training data on MBPP dataset.