InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct
Yutong Wu, Di Huang, Wenxuan Shi, Wei Wang, Lingzhe Gao, Shihao Liu, Ziyuan Nan, Kaizhao Yuan, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Yewen Pu, Dawei Yin, Xing Hu, Yunji Chen
TL;DR
InverseCoder introduces Inverse-Instruct, a self-improvement pipeline that generates additional instruction data by translating code snippets into natural language, then uses self-evaluation to filter and augment the training set. By leveraging code summarization and pseudo-probability scoring within an open-source framework, it achieves state-of-the-art-like performance on multiple benchmarks with 6.7B-parameter models. The approach reduces reliance on expensive closed-source data and demonstrates robust improvements across Python, multiple languages, and data-science tasks. Together with extensive ablations and scaling analyses, the work highlights the viability and limits of self-generated instruction data for open-source code LLMs, paving the way for more cost-effective model enhancement.
Abstract
Recent advancements in open-source code large language models (LLMs) have been driven by fine-tuning on the data generated from powerful closed-source LLMs, which are expensive to obtain. This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset. We make two observations: (1) A code snippet can serve as the response to different instructions. (2) Instruction-tuned code LLMs perform better at translating code into instructions than the reverse. Based on these observations, we propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset. The additional instruction-response pairs are added to the original dataset, and a stronger code LLM can be obtained by fine-tuning on the augmented dataset. We empirically validate Inverse-Instruct on a range of open-source code models (e.g. CodeLlama-Python and DeepSeek-Coder) and benchmarks (e.g., HumanEval(+), MBPP(+), DS-1000 and MultiPL-E), showing it consistently improves the base models.
