Table of Contents
Fetching ...

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

Yejie Wang, Keqing He, Dayuan Fu, Zhuoma Gongque, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, Jingang Wang, Mengdi Zhang, Xunliang Cai, Weiran Xu

TL;DR

This paper tackles the problem of data quality in code instruction tuning, revealing that leakage inflates HumanEval performance and undermines transfer to other benchmarks. It proposes a data-efficient pruning framework guided by three dimensions—instruction complexity, response quality, and instruction diversity—via a Complexity Scorer, a Unit Test Model, and Diversity-based Sampling. The authors train XCoder, a family of LLaMA3-based models, on the pruned data and demonstrate state-of-the-art or competitive results on LiveCodeBench and HumanEval with substantially less data, while also providing detailed analysis of data-source characteristics. The work offers practical insights for constructing high-quality code instruction data and highlights the value of targeted data selection over sheer scale. Overall, it contributes a principled approach to dataset curation and demonstrates robust, data-efficient improvements for open-source code LLMs.

Abstract

Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show XCoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs. Our models and dataset are released in https://github.com/banksy23/XCoder

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

TL;DR

This paper tackles the problem of data quality in code instruction tuning, revealing that leakage inflates HumanEval performance and undermines transfer to other benchmarks. It proposes a data-efficient pruning framework guided by three dimensions—instruction complexity, response quality, and instruction diversity—via a Complexity Scorer, a Unit Test Model, and Diversity-based Sampling. The authors train XCoder, a family of LLaMA3-based models, on the pruned data and demonstrate state-of-the-art or competitive results on LiveCodeBench and HumanEval with substantially less data, while also providing detailed analysis of data-source characteristics. The work offers practical insights for constructing high-quality code instruction data and highlights the value of targeted data selection over sheer scale. Overall, it contributes a principled approach to dataset curation and demonstrates robust, data-efficient improvements for open-source code LLMs.

Abstract

Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show XCoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs. Our models and dataset are released in https://github.com/banksy23/XCoder
Paper Structure (37 sections, 9 figures, 9 tables, 1 algorithm)

This paper contains 37 sections, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: The left figure shows performance comparison on different benchmarks and the right displays varying results after data decontamination. Magicoder Evol-Instruct and Code-Feedback may have data leakage on HumanEval.
  • Figure 2: Illustration of our data selection approach.
  • Figure 3: Comparison of the performance of XCoder and other mainstream models on LiveCodeBench. Results for other models are sourced from LiveCodeBench Leaderboard LiveCode22 For XCoder, we maintain the same settings with other models, where we use 0.2 temperature, sampling 10 solutions for each question. The full name of GPT-4, Glaude-3, Gemini Pro 1.5, GPT-3.5-Turbo, CQ-7B-Chat and MagicoderS-CL-7B are GPT-4o-2024-05-13, GPT-4-Turbo-2024-04-09, Claude-3-opus, Gemini Pro 1.5-May, GPT-3.5-Turbo-0125, CodeQwen15-7B-chat and MagicoderS-CodeLLaMA-7B. We also compare the performance of the model on HumanEval. The complete results can be found in Appendix \ref{['appendix_comparion']}.
  • Figure 4: Comparison of the accuracy of Unit Test Models trained on different sizes when generating test cases. We also additionally evaluated the ability of GPT-4 to generate test cases.
  • Figure 5: The contribution ratio of different data sources to XCoder, with (a) representing the source of the 160K samples with the highest complexity, (b) representing the 160K samples with the highest quality, and (c) and (d) reflecting which dataset has better diversity.
  • ...and 4 more figures