Table of Contents
Fetching ...

Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning

Yun-Da Tsai, Mingjie Liu, Haoxing Ren

TL;DR

This paper tackles the high data requirements of code-focused LLM fine-tuning by introducing a scalable data pruning pipeline that uses embedding-based dimension reduction, clustering, and pruning metrics to select a representative subset. It shows that large synthetic code datasets contain substantial redundancy, with 10% of data retaining most benchmark performance and even 1% yielding strong gains over a base model on several tasks. The main contributions include a thorough experimental evaluation across clustering algorithms and pruning metrics, ablation studies on PCA and embedding inputs, and concrete guidance on achieving data-efficient code generation. The results suggest that carefully pruned data can reduce compute costs while preserving or enhancing code generation quality, enabling more accessible and scalable fine-tuning of code LLMs.

Abstract

Recent work targeting large language models (LLMs) for code generation demonstrated that increasing the amount of training data through synthetic code generation often leads to exceptional performance. In this paper we explore data pruning methods aimed at enhancing the efficiency of model training specifically for code LLMs. We present techniques that integrate various clustering and pruning metrics to selectively reduce training data without compromising the accuracy and functionality of the generated code. We observe significant redundancies in synthetic training data generation, where our experiments demonstrate that benchmark performance can be largely preserved by training on only 10% of the data. Moreover, we observe consistent improvements in benchmark results through moderate pruning of the training data. Our experiments show that these pruning strategies not only reduce the computational resources needed but also enhance the overall quality code generation.

Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning

TL;DR

This paper tackles the high data requirements of code-focused LLM fine-tuning by introducing a scalable data pruning pipeline that uses embedding-based dimension reduction, clustering, and pruning metrics to select a representative subset. It shows that large synthetic code datasets contain substantial redundancy, with 10% of data retaining most benchmark performance and even 1% yielding strong gains over a base model on several tasks. The main contributions include a thorough experimental evaluation across clustering algorithms and pruning metrics, ablation studies on PCA and embedding inputs, and concrete guidance on achieving data-efficient code generation. The results suggest that carefully pruned data can reduce compute costs while preserving or enhancing code generation quality, enabling more accessible and scalable fine-tuning of code LLMs.

Abstract

Recent work targeting large language models (LLMs) for code generation demonstrated that increasing the amount of training data through synthetic code generation often leads to exceptional performance. In this paper we explore data pruning methods aimed at enhancing the efficiency of model training specifically for code LLMs. We present techniques that integrate various clustering and pruning metrics to selectively reduce training data without compromising the accuracy and functionality of the generated code. We observe significant redundancies in synthetic training data generation, where our experiments demonstrate that benchmark performance can be largely preserved by training on only 10% of the data. Moreover, we observe consistent improvements in benchmark results through moderate pruning of the training data. Our experiments show that these pruning strategies not only reduce the computational resources needed but also enhance the overall quality code generation.
Paper Structure (28 sections, 3 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 3 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: The overview of efficient data pruning for fine-tuning LLMs with large scale datasets. First, We reduce the encode instruction-following data into embedding and reduce the dimension of feature representation. Second, we apply clustering to identify and group up similar data samples. Finally, we applied pruning metrics to further reduce data size.
  • Figure 2: Performance comparison of HDBSCAN-diversity and nocluster-random methods across different benchmarks. Our strategy outperform the baseline across different datasets with a large margin. We also maintain better or equivalent performance compare to full dataset even at the size of 10% on MBPP. The $pass@1$ metric is plotted against varying compression ratios, demonstrating the robustness and effectiveness. HumanEval presents larger variance across experiments possibly due to less problems entries.
  • Figure 3: Comparison of performance under extreme data pruning conditions on the MBPP and HumanEval benchmarks. The $pass@1$ score on MBPP shows that even with just 1% of the data, our method achieves nearly equivalent performance to the full dataset, with a 4.1% improvement over the base model. On the HumanEval benchmark, while the performance with 1% of the data degrades compared to the full dataset training, it still achieves an 17.0% improvement over the base model.
  • Figure 4: $pass@1$ on the MBPP benchmark comparing across different clustering algorithms and varied compression ratios of the training dataset. HDBSCAN demonstrate strong robustness in maintaining higher $pass@1$ scores compared to full dataset at the compression ratio of 90%.
  • Figure 5: Comparison of different pruning metrics using HDBSCAN clustering algorithms. Diversity metric has marginal advantage but its benefit may be limited and dependent on the clustering algorithm.
  • ...and 3 more figures