Table of Contents
Fetching ...

Data-efficient LLM Fine-tuning for Code Generation

Weijie Lv, Xuan Xia, Sheng-Jun Huang

TL;DR

Open-source code LLMs lag behind closed models in code generation tasks. This work tackles data efficiency by combining a quality-focused data selection with a context-aware tokenization strategy, introducing a data pipeline that balances complexity with distribution-preserving sampling via K-Means clustering and an Instruction Following Difficulty score. The dynamic pack tokenization maximizes context usage and minimizes padding, dramatically reducing padding and resource usage. Across OSS-Instruct and Evol-Instruct, and for DS-Base-6.7B and CodeLlama-7B variants, training on 30–40% of data can match or exceed full-dataset performance while substantially reducing training time and peak memory, demonstrating practical benefits for open-source LLM fine-tuning in code generation.

Abstract

Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically generate large amounts of synthetic data for fine-tuning, which often leads to inefficient training. In this work, we propose a data selection strategy in order to improve the effectiveness and efficiency of training for code-based LLMs. By prioritizing data complexity and ensuring that the sampled subset aligns with the distribution of the original dataset, our sampling strategy effectively selects high-quality data. Additionally, we optimize the tokenization process through a "dynamic pack" technique, which minimizes padding tokens and reduces computational resource consumption. Experimental results show that when training on 40% of the OSS-Instruct dataset, the DeepSeek-Coder-Base-6.7B model achieves an average performance of 66.9%, surpassing the 66.1% performance with the full dataset. Moreover, training time is reduced from 47 minutes to 34 minutes, and the peak GPU memory decreases from 61.47 GB to 42.72 GB during a single epoch. Similar improvements are observed with the CodeLlama-Python-7B model on the Evol-Instruct dataset. By optimizing both data selection and tokenization, our approach not only improves model performance but also improves training efficiency.

Data-efficient LLM Fine-tuning for Code Generation

TL;DR

Open-source code LLMs lag behind closed models in code generation tasks. This work tackles data efficiency by combining a quality-focused data selection with a context-aware tokenization strategy, introducing a data pipeline that balances complexity with distribution-preserving sampling via K-Means clustering and an Instruction Following Difficulty score. The dynamic pack tokenization maximizes context usage and minimizes padding, dramatically reducing padding and resource usage. Across OSS-Instruct and Evol-Instruct, and for DS-Base-6.7B and CodeLlama-7B variants, training on 30–40% of data can match or exceed full-dataset performance while substantially reducing training time and peak memory, demonstrating practical benefits for open-source LLM fine-tuning in code generation.

Abstract

Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically generate large amounts of synthetic data for fine-tuning, which often leads to inefficient training. In this work, we propose a data selection strategy in order to improve the effectiveness and efficiency of training for code-based LLMs. By prioritizing data complexity and ensuring that the sampled subset aligns with the distribution of the original dataset, our sampling strategy effectively selects high-quality data. Additionally, we optimize the tokenization process through a "dynamic pack" technique, which minimizes padding tokens and reduces computational resource consumption. Experimental results show that when training on 40% of the OSS-Instruct dataset, the DeepSeek-Coder-Base-6.7B model achieves an average performance of 66.9%, surpassing the 66.1% performance with the full dataset. Moreover, training time is reduced from 47 minutes to 34 minutes, and the peak GPU memory decreases from 61.47 GB to 42.72 GB during a single epoch. Similar improvements are observed with the CodeLlama-Python-7B model on the Evol-Instruct dataset. By optimizing both data selection and tokenization, our approach not only improves model performance but also improves training efficiency.

Paper Structure

This paper contains 19 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overview of our proposed data selection strategy, including three steps. Step 1: Partitioning the synthetic dataset into multiple clusters. Step 2: Computing the Instruction Following Difficulty score by comparing the model's perplexity with and without instructions. Step 3: Sampling the top m% instances from each re-ranked cluster to form a high-complexity sub-dataset that preserves data consistency. Finally, the selected data is used for fine-tuning open-source code LLMs.
  • Figure 2: Impact of sampling rates on model performance. The results demonstrate that model performance peaks at sampling rates between 30% and 40%, after which it begins to decline. "Average Full Data" denotes the model's average performance on the HumanEval and MBPP benchmarks when trained on the full data. "Average" reflects the model's average performance on these two benchmarks at different sampling rates.