Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning
Yun-Da Tsai, Mingjie Liu, Haoxing Ren
TL;DR
This paper tackles the high data requirements of code-focused LLM fine-tuning by introducing a scalable data pruning pipeline that uses embedding-based dimension reduction, clustering, and pruning metrics to select a representative subset. It shows that large synthetic code datasets contain substantial redundancy, with 10% of data retaining most benchmark performance and even 1% yielding strong gains over a base model on several tasks. The main contributions include a thorough experimental evaluation across clustering algorithms and pruning metrics, ablation studies on PCA and embedding inputs, and concrete guidance on achieving data-efficient code generation. The results suggest that carefully pruned data can reduce compute costs while preserving or enhancing code generation quality, enabling more accessible and scalable fine-tuning of code LLMs.
Abstract
Recent work targeting large language models (LLMs) for code generation demonstrated that increasing the amount of training data through synthetic code generation often leads to exceptional performance. In this paper we explore data pruning methods aimed at enhancing the efficiency of model training specifically for code LLMs. We present techniques that integrate various clustering and pruning metrics to selectively reduce training data without compromising the accuracy and functionality of the generated code. We observe significant redundancies in synthetic training data generation, where our experiments demonstrate that benchmark performance can be largely preserved by training on only 10% of the data. Moreover, we observe consistent improvements in benchmark results through moderate pruning of the training data. Our experiments show that these pruning strategies not only reduce the computational resources needed but also enhance the overall quality code generation.
