Table of Contents
Fetching ...

InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct

Yutong Wu, Di Huang, Wenxuan Shi, Wei Wang, Lingzhe Gao, Shihao Liu, Ziyuan Nan, Kaizhao Yuan, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Yewen Pu, Dawei Yin, Xing Hu, Yunji Chen

TL;DR

InverseCoder introduces Inverse-Instruct, a self-improvement pipeline that generates additional instruction data by translating code snippets into natural language, then uses self-evaluation to filter and augment the training set. By leveraging code summarization and pseudo-probability scoring within an open-source framework, it achieves state-of-the-art-like performance on multiple benchmarks with 6.7B-parameter models. The approach reduces reliance on expensive closed-source data and demonstrates robust improvements across Python, multiple languages, and data-science tasks. Together with extensive ablations and scaling analyses, the work highlights the viability and limits of self-generated instruction data for open-source code LLMs, paving the way for more cost-effective model enhancement.

Abstract

Recent advancements in open-source code large language models (LLMs) have been driven by fine-tuning on the data generated from powerful closed-source LLMs, which are expensive to obtain. This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset. We make two observations: (1) A code snippet can serve as the response to different instructions. (2) Instruction-tuned code LLMs perform better at translating code into instructions than the reverse. Based on these observations, we propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset. The additional instruction-response pairs are added to the original dataset, and a stronger code LLM can be obtained by fine-tuning on the augmented dataset. We empirically validate Inverse-Instruct on a range of open-source code models (e.g. CodeLlama-Python and DeepSeek-Coder) and benchmarks (e.g., HumanEval(+), MBPP(+), DS-1000 and MultiPL-E), showing it consistently improves the base models.

InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct

TL;DR

InverseCoder introduces Inverse-Instruct, a self-improvement pipeline that generates additional instruction data by translating code snippets into natural language, then uses self-evaluation to filter and augment the training set. By leveraging code summarization and pseudo-probability scoring within an open-source framework, it achieves state-of-the-art-like performance on multiple benchmarks with 6.7B-parameter models. The approach reduces reliance on expensive closed-source data and demonstrates robust improvements across Python, multiple languages, and data-science tasks. Together with extensive ablations and scaling analyses, the work highlights the viability and limits of self-generated instruction data for open-source code LLMs, paving the way for more cost-effective model enhancement.

Abstract

Recent advancements in open-source code large language models (LLMs) have been driven by fine-tuning on the data generated from powerful closed-source LLMs, which are expensive to obtain. This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset. We make two observations: (1) A code snippet can serve as the response to different instructions. (2) Instruction-tuned code LLMs perform better at translating code into instructions than the reverse. Based on these observations, we propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset. The additional instruction-response pairs are added to the original dataset, and a stronger code LLM can be obtained by fine-tuning on the augmented dataset. We empirically validate Inverse-Instruct on a range of open-source code models (e.g. CodeLlama-Python and DeepSeek-Coder) and benchmarks (e.g., HumanEval(+), MBPP(+), DS-1000 and MultiPL-E), showing it consistently improves the base models.
Paper Structure (60 sections, 2 equations, 13 figures, 17 tables, 1 algorithm)

This paper contains 60 sections, 2 equations, 13 figures, 17 tables, 1 algorithm.

Figures (13)

  • Figure 1: The overview of Inverse-Instruct. Inverse-Instruct utilizes the models' own capability in code summarization to generate an inverse instruction dataset which can further enhance the model's performance. Inverse-Instruct consists of three steps, including code preprocessing, code summarization, and self-evaluation & data selection.
  • Figure 2: Impact of data scaling. The dashed line represents HumanEval and the solid line represents HumanEval+. Legend "Original" and "Ours" represent the original models and the models improved by Inverse-Instruct.
  • Figure 3: The prompts of Inverse-Instruct for code summarization, self-evaluation, and instruction-tuning. For code summarization, we use a diverse set of initial verbs in the prefixes to ensure the overall diversity of the instructions. We first count the first verb frequencies of each instruction in the original dataset and choose the top 5 most frequent verbs. Then we ask ChatGPT to give similar verbs to expand the first verb pool for prompt prefixes.
  • Figure 4: An example response with multiple parts of code.
  • Figure 5: An example of a summarization mistake.
  • ...and 8 more figures